!jupyter nbconvert --to html '/content/drive/MyDrive/MIT Data Science - Python Course/Capstone/(Full) ULKUCU_Capstone_Project_Reference_Notebook_Full_Code_Used_Cars_Price_Prediction.ipynb'
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT Data Science - Python Course/Capstone/ULKUCU_Capstone_Project_Reference_Notebook_Full_Code_Used_Cars_Price_Prediction.ipynb to html [NbConvertApp] Writing 5529576 bytes to /content/drive/MyDrive/MIT Data Science - Python Course/Capstone/ULKUCU_Capstone_Project_Reference_Notebook_Full_Code_Used_Cars_Price_Prediction.html
S.No. : Serial Number
Name : Name of the car which includes Brand name and Model name
Location : The location in which the car is being sold or is available for purchase (Cities)
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM
Fuel_Type : The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car (Automatic / Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC
Power : The maximum power of the engine in bhp
Seats : The number of seats in the car
New_Price : The price of a new car of the same model in INR 100,000
Price : The price of the used car in INR 100,000 (Target Variable)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build models for prediction
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor
# To encode categorical variables
from sklearn.preprocessing import LabelEncoder
# For tuning the model
from sklearn.model_selection import GridSearchCV
# To check model performance
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
# let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
data = pd.read_csv("/content/drive/MyDrive/MIT Data Science - Python Course/Capstone/used_cars.csv")
data # see first five & last five rows
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN |
| 7249 | 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN |
| 7250 | 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN |
| 7251 | 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN |
| 7252 | 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN |
7253 rows × 14 columns
INITIAL Data Loading Observations:
7,253 rows -- relatively large
'S.No' - unique identifier → drop
A Couple Notes on a Couple Columns
The car manufacturer is in the 'Name' column, so I may consider extracting that in order to group by manufacturer since it definitely is a factor in pricing and customer decision to buy.
NOTE: The variable named 'Mileage' is not in miles. After some research, I found that if the 'Fuel_Type' is CNG (Compressed Natural Gas), then the value in the 'Mileage' column is measured in km/kg (Kilometers per Kilogram of CNG).
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 S.No. 7253 non-null int64 1 Name 7253 non-null object 2 Location 7253 non-null object 3 Year 7253 non-null int64 4 Kilometers_Driven 7253 non-null int64 5 Fuel_Type 7253 non-null object 6 Transmission 7253 non-null object 7 Owner_Type 7253 non-null object 8 Mileage 7251 non-null float64 9 Engine 7207 non-null float64 10 Power 7078 non-null float64 11 Seats 7200 non-null float64 12 New_price 1006 non-null float64 13 Price 6019 non-null float64 dtypes: float64(6), int64(3), object(5) memory usage: 793.4+ KB
Object Type Observations
# Writing a Bold function for use elsewhere instead of having to type '\033[1m' every time (which is hardly intuitive)
def bold_text(text):
bold_start = '\033[1m'
bold_end = '\033[0m'
return bold_start + text + bold_end
# Check for Uniques, Nulls & Duplicates
# Uniques in each column
unique_values = data.nunique()
print('Number of ' + bold_text('unique values') + ' in each column: \n', unique_values)
# Null values in each column
null_values = data.isnull().sum()
print('\n Number of ' + bold_text('null values') + ' in each column: \n', null_values)
# Duplicate rows
duplicate_rows = data.duplicated().sum()
print('\n Number of ' + bold_text('duplicate rows') + ':', duplicate_rows)
Number of unique values in each column: S.No. 7253 Name 2041 Location 11 Year 23 Kilometers_Driven 3660 Fuel_Type 5 Transmission 2 Owner_Type 4 Mileage 438 Engine 150 Power 383 Seats 8 New_price 625 Price 1373 dtype: int64 Number of null values in each column: S.No. 0 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Engine 46 Power 175 Seats 53 New_price 6247 Price 1234 dtype: int64 Number of duplicate rows: 0
UNIQUE
NULL (# = how many null values)
DUPLICATES
We'll drop 'Price' when we get to splitting data for training & testing after some initial analyses since that's the target
Questions:
Look at Data after dropping Serial Number:
# Set a variable to drop (in case you wanted to do something else)
column_to_drop = 'S.No.' # Drop serial number since it's just an ID
# Check if the column exists in the DataFrame
if column_to_drop in data.columns:
# Drop the column if it exists
data.drop(columns=(column_to_drop), inplace=True)
data
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN |
| 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN |
| 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN |
| 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN |
| 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN |
7253 rows × 13 columns
# Statistical Analysis of Numerical Columns
data.describe()
| Year | Kilometers_Driven | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|
| count | 7253.000000 | 7.253000e+03 | 7251.000000 | 7207.000000 | 7078.000000 | 7200.000000 | 1006.000000 | 6019.000000 |
| mean | 2013.365366 | 5.869906e+04 | 18.141580 | 1616.573470 | 112.765214 | 5.280417 | 22.779692 | 9.479468 |
| std | 3.254421 | 8.442772e+04 | 4.562197 | 595.285137 | 53.493553 | 0.809277 | 27.759344 | 11.187917 |
| min | 1996.000000 | 1.710000e+02 | 0.000000 | 72.000000 | 34.200000 | 2.000000 | 3.910000 | 0.440000 |
| 25% | 2011.000000 | 3.400000e+04 | 15.170000 | 1198.000000 | 75.000000 | 5.000000 | 7.885000 | 3.500000 |
| 50% | 2014.000000 | 5.341600e+04 | 18.160000 | 1493.000000 | 94.000000 | 5.000000 | 11.570000 | 5.640000 |
| 75% | 2016.000000 | 7.300000e+04 | 21.100000 | 1968.000000 | 138.100000 | 5.000000 | 26.042500 | 9.950000 |
| max | 2019.000000 | 6.500000e+06 | 33.540000 | 5998.000000 | 616.000000 | 10.000000 | 375.000000 | 160.000000 |
Statistics Observations:
'Year', 'Kilometers_Driven', 'Mileage', 'Seats' = mean & median are quite close to each other --> generally more tightly distributed. This is especially true for 'Year' since the standard deviation is a tiny fraction of the mean.
'Engine' & 'Power' = mean & median are somewhat close to each other (~10-20% gap)
'New_price' & 'Price' = mean & median are quite far apart. One factor could be the extensive missing values.
There are also several columns which have extremely high maximums (ex: 'New_price'), which suggests significant skew.
# Summary Statistics for Categorical variables
data.describe(include=['O'])
| Name | Location | Fuel_Type | Transmission | Owner_Type | |
|---|---|---|---|---|---|
| count | 7253 | 7253 | 7253 | 7253 | 7253 |
| unique | 2041 | 11 | 5 | 2 | 4 |
| top | Mahindra XUV500 W8 2WD | Mumbai | Diesel | Manual | First |
| freq | 55 | 949 | 3852 | 5204 | 5952 |
# Print the unique values in each categorical column
for column in data.select_dtypes(include=['object']).columns:
print(f"Unique values in {bold_text(column)}:")
print(data[column].value_counts())
print("\n")
Unique values in Name: Name Mahindra XUV500 W8 2WD 55 Maruti Swift VDI 49 Maruti Swift Dzire VDI 42 Honda City 1.5 S MT 39 Maruti Swift VDI BSIV 37 .. Chevrolet Beat LT Option 1 Skoda Rapid 1.6 MPI AT Elegance Plus 1 Ford EcoSport 1.5 TDCi Ambiente 1 Hyundai i10 Magna 1.1 iTech SE 1 Hyundai Elite i20 Magna Plus 1 Name: count, Length: 2041, dtype: int64 Unique values in Location: Location Mumbai 949 Hyderabad 876 Coimbatore 772 Kochi 772 Pune 765 Delhi 660 Kolkata 654 Chennai 591 Jaipur 499 Bangalore 440 Ahmedabad 275 Name: count, dtype: int64 Unique values in Fuel_Type: Fuel_Type Diesel 3852 Petrol 3325 CNG 62 LPG 12 Electric 2 Name: count, dtype: int64 Unique values in Transmission: Transmission Manual 5204 Automatic 2049 Name: count, dtype: int64 Unique values in Owner_Type: Owner_Type First 5952 Second 1152 Third 137 Fourth & Above 12 Name: count, dtype: int64
# Just showing the proportions for each specific car model ('Name') before removing that column
print(data['Name'].value_counts(1)) # the (1) normalizes the data to proportions
Name
Mahindra XUV500 W8 2WD 0.007583
Maruti Swift VDI 0.006756
Maruti Swift Dzire VDI 0.005791
Honda City 1.5 S MT 0.005377
Maruti Swift VDI BSIV 0.005101
...
Chevrolet Beat LT Option 0.000138
Skoda Rapid 1.6 MPI AT Elegance Plus 0.000138
Ford EcoSport 1.5 TDCi Ambiente 0.000138
Hyundai i10 Magna 1.1 iTech SE 0.000138
Hyundai Elite i20 Magna Plus 0.000138
Name: proportion, Length: 2041, dtype: float64
# Extract Manufacturer
def extract_manufacturer(car_name):
"""
Extracts the manufacturer from a car name.
Args:
car_name: The name of the car.
Returns:
The manufacturer of the car.
"""
# Split the car name into words.
words = car_name.split()
# Exception for "Land Rover" (two word manufacturer name)
if (words[0] == 'Land'):
manufacturer = words[0] + ' ' + words[1]
else:
# The first word is usually the manufacturer.
manufacturer = words[0]
# Return the manufacturer.
return manufacturer
### APPLY the function ###
# Apply the function to the 'Name' column.
data['Manufacturer'] = data['Name'].apply(extract_manufacturer)
# Check the new data w/ added Manufacturer column
data.sample(3)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2697 | Tata Indigo LS | Pune | 2006 | 160000 | Diesel | Manual | First | 17.0 | 1405.0 | 70.0 | 5.0 | NaN | 1.00 | Tata |
| 5171 | Honda City i DTEC VX | Kochi | 2017 | 65850 | Diesel | Manual | First | 25.1 | 1498.0 | 98.6 | 5.0 | NaN | 9.44 | Honda |
| 2118 | Mahindra XUV500 W10 2WD | Delhi | 2015 | 62000 | Diesel | Manual | First | 16.0 | 2179.0 | 140.0 | 7.0 | NaN | 10.25 | Mahindra |
# List brand names
print(sorted(data['Manufacturer'].unique())) # "sorted()" alphabetizes the list
['Ambassador', 'Audi', 'BMW', 'Bentley', 'Chevrolet', 'Datsun', 'Fiat', 'Force', 'Ford', 'Hindustan', 'Honda', 'Hyundai', 'ISUZU', 'Isuzu', 'Jaguar', 'Jeep', 'Lamborghini', 'Land Rover', 'Mahindra', 'Maruti', 'Mercedes-Benz', 'Mini', 'Mitsubishi', 'Nissan', 'OpelCorsa', 'Porsche', 'Renault', 'Skoda', 'Smart', 'Tata', 'Toyota', 'Volkswagen', 'Volvo']
# Combine ISUZU into Isuzu since we only need the latter to match the spelling convention
data['Manufacturer'] = data['Manufacturer'].replace('ISUZU', 'Isuzu')
print(sorted(data['Manufacturer'].unique())) # "sorted()" alphabetizes the list
['Ambassador', 'Audi', 'BMW', 'Bentley', 'Chevrolet', 'Datsun', 'Fiat', 'Force', 'Ford', 'Hindustan', 'Honda', 'Hyundai', 'Isuzu', 'Jaguar', 'Jeep', 'Lamborghini', 'Land Rover', 'Mahindra', 'Maruti', 'Mercedes-Benz', 'Mini', 'Mitsubishi', 'Nissan', 'OpelCorsa', 'Porsche', 'Renault', 'Skoda', 'Smart', 'Tata', 'Toyota', 'Volkswagen', 'Volvo']
# Drop 'Name' since there are so many unique values & we just extracted the Manufacturer
column_to_drop = 'Name'
# Check if the column exists in the DataFrame to avoid error when rerunning specific cells while building/debugging code
if column_to_drop in data.columns:
# Drop the column if it exists
data.drop(columns=(column_to_drop), inplace=True)
# Defining a variable for string object columns (categorical)
string_columns = data.select_dtypes(include=['object']).columns.tolist()
print(string_columns) # print the list
['Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Manufacturer']
data.sample(3)
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2946 | Coimbatore | 2016 | 59711 | Diesel | Manual | First | 15.10 | 2179.0 | 140.0 | 7.0 | NaN | 13.28 | Mahindra |
| 4122 | Kochi | 2017 | 44887 | Diesel | Manual | First | 19.67 | 1582.0 | 126.2 | 5.0 | NaN | 12.13 | Hyundai |
| 4418 | Coimbatore | 2014 | 48997 | Petrol | Manual | Second | 18.00 | 1198.0 | 86.7 | 5.0 | NaN | 5.21 | Honda |
# Printing the proportion of occurrences of each unique value in each categorical column
for column in string_columns:
print(data[column].value_counts(1)) # the (1) normalizes the data to proportions
print("-" * 50)
Location Mumbai 0.130842 Hyderabad 0.120778 Coimbatore 0.106439 Kochi 0.106439 Pune 0.105474 Delhi 0.090997 Kolkata 0.090170 Chennai 0.081484 Jaipur 0.068799 Bangalore 0.060665 Ahmedabad 0.037915 Name: proportion, dtype: float64 -------------------------------------------------- Fuel_Type Diesel 0.531091 Petrol 0.458431 CNG 0.008548 LPG 0.001654 Electric 0.000276 Name: proportion, dtype: float64 -------------------------------------------------- Transmission Manual 0.717496 Automatic 0.282504 Name: proportion, dtype: float64 -------------------------------------------------- Owner_Type First 0.820626 Second 0.158831 Third 0.018889 Fourth & Above 0.001654 Name: proportion, dtype: float64 -------------------------------------------------- Manufacturer Maruti 0.199090 Hyundai 0.184751 Honda 0.102440 Toyota 0.069902 Mercedes-Benz 0.052392 Volkswagen 0.051565 Ford 0.048394 Mahindra 0.045636 BMW 0.043017 Audi 0.039294 Tata 0.031435 Skoda 0.027851 Renault 0.023439 Chevrolet 0.020819 Nissan 0.016131 Land Rover 0.009238 Jaguar 0.006618 Fiat 0.005239 Mitsubishi 0.004963 Mini 0.004274 Volvo 0.003860 Porsche 0.002620 Jeep 0.002620 Datsun 0.002344 Isuzu 0.000689 Force 0.000414 Bentley 0.000276 Smart 0.000138 Ambassador 0.000138 Lamborghini 0.000138 Hindustan 0.000138 OpelCorsa 0.000138 Name: proportion, dtype: float64 --------------------------------------------------
Proportion Observations
The top two locations (Mumbai & Hyderabad) have about 12-13% of the data. The next five have about 9-10% and the least common (Ahmedabad) has about 3.8%. So it's fairly spread out.
Diesel & Petrol constitute 99% of the fuel type (to be expected). There are extremely few electric cars (0.003%). They are more expensive not only to own, but they also require a more substantial electric grid for charging.
Manual surpasses Automatic transmission by more than a 2:1 ratio. This is true in many countries outside the US (Americans generally do not drive stick shift).
82% of the data shows that cars are with their original owner ("first"). This could be due to people needing to stick with the first car they bought and not being able to afford selling a car to try and buy another one.
Maruti & Hyundai make up about 40% of the data (almost 20% each).
num_cols = 3
adjustment = num_cols - 1
num_rows = (len(string_columns) + adjustment) // num_cols
axes = plt.subplots(num_rows, num_cols, figsize=(20, 10))[1]
for i, col in enumerate(string_columns):
ax = axes[i // num_cols, i % num_cols]
data[col].value_counts().plot(kind='pie', ax=ax, autopct='%1.1f%%')
ax.axis('off') # remove the axis label since that's the same as the title
# Set the title of the pie chart
ax.set_title(col)
for i in range(len(string_columns), num_rows * num_cols):
axes[i // num_cols, i % num_cols].set_visible(False)
# Display the plot
plt.tight_layout()
plt.show()
REVIEW (Categorical)
These pie charts just help to see the results previously analyzed
'Location' = Relatively even spread out. Every location has a significant portion of car sales.
'Fuel_Type' = Diesel & Petrol are roughly 50/50 (with a very small amount of data left for Electric & CNG (Condensed Natural Gas), which is why their names are jammed together on the chart.
'Transmission' = roughly a 3:1 distribution for Manual & Automatic
'Owner_Type' = over a 5:1 ratio for First & Second, then a slim portion left for Third and Fourth & Above
Honda, Hyundai and Maruti occupy nearly half of the data, and there are many companies that have so few sales that they're packed together because of their thin slices of the pie.
# 'New_price' Statistics grouped by Manufacturer
data.groupby('Manufacturer')['New_price'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Manufacturer | ||||||||
| Ambassador | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Audi | 29.0 | 65.141724 | 24.946476 | 42.89 | 48.6400 | 55.610 | 67.6700 | 128.00 |
| BMW | 63.0 | 62.473333 | 22.753032 | 43.31 | 48.7900 | 58.670 | 64.2350 | 166.00 |
| Bentley | 1.0 | 375.000000 | NaN | 375.00 | 375.0000 | 375.000 | 375.0000 | 375.00 |
| Chevrolet | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Datsun | 11.0 | 4.848182 | 1.052443 | 3.98 | 4.0450 | 4.370 | 5.6350 | 6.70 |
| Fiat | 4.0 | 9.002500 | 0.447242 | 8.60 | 8.6375 | 8.955 | 9.3200 | 9.50 |
| Force | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Ford | 29.0 | 14.074138 | 14.311504 | 7.56 | 9.1200 | 11.200 | 12.1200 | 83.92 |
| Hindustan | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Honda | 83.0 | 11.669759 | 4.055201 | 5.89 | 8.7500 | 11.150 | 14.6600 | 33.51 |
| Hyundai | 117.0 | 11.836154 | 5.269123 | 4.55 | 7.5400 | 9.740 | 15.5700 | 29.90 |
| Isuzu | 1.0 | 33.680000 | NaN | 33.68 | 33.6800 | 33.680 | 33.6800 | 33.68 |
| Jaguar | 7.0 | 58.455714 | 7.978318 | 52.77 | 53.1850 | 53.720 | 61.8450 | 72.64 |
| Jeep | 19.0 | 22.746316 | 2.805768 | 18.16 | 20.6950 | 22.950 | 24.1400 | 27.56 |
| Lamborghini | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Land Rover | 10.0 | 119.322000 | 51.697215 | 66.39 | 74.6475 | 120.000 | 139.0000 | 230.00 |
| Mahindra | 36.0 | 11.030278 | 2.520608 | 6.55 | 10.0550 | 11.145 | 11.6700 | 21.33 |
| Maruti | 237.0 | 7.845907 | 2.746784 | 3.91 | 5.2900 | 7.870 | 10.5700 | 13.33 |
| Mercedes-Benz | 49.0 | 73.642449 | 26.813965 | 34.63 | 49.1400 | 78.760 | 95.0400 | 171.00 |
| Mini | 19.0 | 40.894211 | 3.784658 | 34.89 | 37.8850 | 42.120 | 44.2800 | 44.59 |
| Mitsubishi | 13.0 | 33.314615 | 0.970787 | 31.93 | 32.4100 | 33.210 | 34.1000 | 34.60 |
| Nissan | 19.0 | 11.612632 | 3.760443 | 6.78 | 9.1000 | 9.510 | 15.0600 | 17.21 |
| OpelCorsa | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Porsche | 1.0 | 136.000000 | NaN | 136.00 | 136.0000 | 136.000 | 136.0000 | 136.00 |
| Renault | 33.0 | 7.676061 | 4.625152 | 3.95 | 4.7800 | 4.930 | 13.0800 | 15.94 |
| Skoda | 28.0 | 26.213571 | 10.642805 | 11.89 | 14.9275 | 31.365 | 33.4150 | 40.73 |
| Smart | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Tata | 52.0 | 8.775769 | 4.095953 | 5.45 | 6.5900 | 7.630 | 8.3950 | 22.74 |
| Toyota | 94.0 | 23.627021 | 8.438001 | 6.41 | 19.6875 | 23.250 | 26.9875 | 40.62 |
| Volkswagen | 48.0 | 11.910000 | 4.349351 | 6.74 | 9.9675 | 11.105 | 12.7800 | 36.45 |
| Volvo | 3.0 | 43.080000 | 4.486012 | 37.90 | 41.7850 | 45.670 | 45.6700 | 45.67 |
Observations ('New_price')
Ambassador, Chevrolet, Force, Hindustan, Lamborgini, OpelCorsa & Smart all are missing data
Porsche, Isuzu & Bentley only have a single value
# Target 'Price' Statistics grouped by Manufacturer
data.groupby('Manufacturer')['Price'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Manufacturer | ||||||||
| Ambassador | 1.0 | 1.350000 | NaN | 1.35 | 1.3500 | 1.350 | 1.3500 | 1.35 |
| Audi | 236.0 | 25.537712 | 12.344783 | 7.50 | 17.5000 | 23.500 | 31.9475 | 72.94 |
| BMW | 267.0 | 25.243146 | 15.164924 | 6.30 | 13.9250 | 21.000 | 32.1600 | 93.67 |
| Bentley | 1.0 | 59.000000 | NaN | 59.00 | 59.0000 | 59.000 | 59.0000 | 59.00 |
| Chevrolet | 121.0 | 3.044463 | 1.827563 | 0.75 | 1.7500 | 2.500 | 3.7000 | 9.87 |
| Datsun | 13.0 | 3.049231 | 0.476156 | 2.25 | 2.7000 | 3.100 | 3.3100 | 3.95 |
| Fiat | 28.0 | 3.269286 | 1.901356 | 0.55 | 2.0875 | 2.600 | 4.7000 | 7.71 |
| Force | 3.0 | 9.333333 | 2.309401 | 8.00 | 8.0000 | 8.000 | 10.0000 | 12.00 |
| Ford | 300.0 | 6.889400 | 7.508742 | 0.69 | 2.7500 | 5.345 | 7.3425 | 56.80 |
| Hindustan | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Honda | 608.0 | 5.411743 | 2.541032 | 0.90 | 3.5400 | 4.950 | 6.5850 | 17.50 |
| Hyundai | 1107.0 | 5.343433 | 3.311355 | 0.45 | 3.0950 | 4.600 | 6.4500 | 23.00 |
| Isuzu | 3.0 | 14.696667 | 6.120133 | 8.00 | 12.0450 | 16.090 | 18.0450 | 20.00 |
| Jaguar | 40.0 | 37.632250 | 18.154640 | 16.00 | 26.5150 | 31.900 | 41.0000 | 100.00 |
| Jeep | 15.0 | 18.718667 | 2.159183 | 15.00 | 17.4750 | 18.500 | 19.7300 | 23.91 |
| Lamborghini | 1.0 | 120.000000 | NaN | 120.00 | 120.0000 | 120.000 | 120.0000 | 120.00 |
| Land Rover | 60.0 | 39.259500 | 24.145777 | 11.00 | 21.5275 | 35.000 | 47.3650 | 160.00 |
| Mahindra | 272.0 | 8.045919 | 3.598662 | 1.50 | 5.5000 | 7.570 | 10.0825 | 17.63 |
| Maruti | 1211.0 | 4.517267 | 2.295684 | 0.45 | 2.8000 | 4.150 | 5.9950 | 11.50 |
| Mercedes-Benz | 318.0 | 26.809874 | 15.892033 | 3.90 | 14.5000 | 24.000 | 35.0000 | 90.00 |
| Mini | 26.0 | 26.896923 | 6.168550 | 17.00 | 22.0000 | 24.285 | 31.8875 | 39.75 |
| Mitsubishi | 27.0 | 11.058889 | 7.138884 | 1.00 | 5.8650 | 9.950 | 13.7250 | 28.00 |
| Nissan | 91.0 | 4.738352 | 1.759250 | 1.75 | 3.2750 | 4.300 | 6.1000 | 8.92 |
| OpelCorsa | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Porsche | 18.0 | 48.348333 | 19.632666 | 2.02 | 40.7275 | 47.020 | 63.2375 | 75.00 |
| Renault | 145.0 | 5.799034 | 2.384815 | 2.40 | 3.8000 | 5.490 | 7.1000 | 14.01 |
| Skoda | 173.0 | 7.559075 | 5.036701 | 1.25 | 4.2500 | 6.000 | 9.7200 | 27.30 |
| Smart | 1.0 | 3.000000 | NaN | 3.00 | 3.0000 | 3.000 | 3.0000 | 3.00 |
| Tata | 186.0 | 3.562849 | 2.693689 | 0.44 | 1.6950 | 2.900 | 4.5800 | 17.85 |
| Toyota | 411.0 | 11.580024 | 7.001564 | 1.50 | 5.5500 | 10.750 | 16.5000 | 35.82 |
| Volkswagen | 315.0 | 5.307270 | 2.460562 | 2.10 | 3.7500 | 4.890 | 6.1550 | 24.90 |
| Volvo | 21.0 | 18.802857 | 5.854036 | 7.25 | 16.0000 | 18.250 | 23.5800 | 32.50 |
Observations ('Price')
OpelCorsa & Hindustan lack data (in addition to the 'New_price) --> they'll be dropped since they have no price data at all and cannot contribute to the model.
Smart, Lamborgini, Bentley & Ambassador only have one value for 'Price'.
# Show manufacturers that are MISSING both prices (since we literally cannot analyze their data)
manufacturers_missing_both_prices = data.groupby('Manufacturer')[['New_price', 'Price']].count()[
(data.groupby('Manufacturer')[['New_price', 'Price']].count()['New_price'] == 0) &
(data.groupby('Manufacturer')[['New_price', 'Price']].count()['Price'] == 0)
].index.tolist()
print(manufacturers_missing_both_prices)
['Hindustan', 'OpelCorsa']
# prompt: drop manufacturers_missing_both_prices
data.drop(data[data['Manufacturer'].isin(manufacturers_missing_both_prices)].index, inplace=True)
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 | Maruti |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 | Hyundai |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 | Maruti |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 | Audi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7249 | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7250 | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN | Nissan |
| 7251 | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7252 | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN | Mercedes-Benz |
7251 rows × 13 columns
# Show manufacturers with 0 or 1 new price
data.groupby('Manufacturer')['New_price'].count()[data.groupby('Manufacturer')['New_price'].count() <= 1]
Manufacturer Ambassador 0 Bentley 1 Chevrolet 0 Force 0 Isuzu 1 Lamborghini 0 Porsche 1 Smart 0 Name: New_price, dtype: int64
You can see now that neither Hindustan nor OpelCorsa are in the data now (the DataFrame is down two rows & they don't show in the Manufacturer counts).
This would have been taken care of with the following code blocks removing rows with insufficient data, but I wanted to start with these particular manufacturers for simplicity since they have no price data whatsoever and contribute nothing to the model.
# Identify manufacturers with 0 or 1 (target) price
data.groupby('Manufacturer')['Price'].count()[data.groupby('Manufacturer')['Price'].count() <= 1]
Manufacturer Ambassador 1 Bentley 1 Lamborghini 1 Smart 1 Name: Price, dtype: int64
manufacturers_insuff_data = data.groupby('Manufacturer')[['New_price', 'Price']].count()[
(data.groupby('Manufacturer')[['New_price', 'Price']].count()['New_price'] <= 1) & # identifies manufacturers with at most one entry for 'New_price'
(data.groupby('Manufacturer')[['New_price', 'Price']].count()['Price'] <= 1) # identifies manufacturers with at most one entry for 'Price'
].index.tolist()
print(manufacturers_insuff_data)
['Ambassador', 'Bentley', 'Lamborghini', 'Smart']
# Drop those brands
data.drop(data[data['Manufacturer'].isin(manufacturers_insuff_data)].index, inplace=True)
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 | Maruti |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 | Hyundai |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 | Maruti |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 | Audi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7249 | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7250 | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN | Nissan |
| 7251 | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7252 | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN | Mercedes-Benz |
7246 rows × 13 columns
After Removing Manufacturers with Insufficient Data
We've gone from 7,253 original rows --> 7,246 rows now (dropped seven rows)
There are still a few manufacturers missing 'New_price' data (Chevrolet, Force, Isuzu & Porche), but they all still have at least two entries in 'Price' (as with all manufacturers now).
Data that low was likely not collected and just inputted as 0's.
Again, this separate call is redundant since mileage this low would be taken care of by the upcoming removal of outliers, but I wanted to make sure in case they linger if/when I adjust the outlier removal threshhold (instead of the standard 1.5xIQR).
# Drop rows with mileage less than five
data = data.drop(data.loc[data['Mileage'] < 5].index)
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 | Maruti |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 | Hyundai |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 | Maruti |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 | Audi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7249 | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7250 | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN | Nissan |
| 7251 | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7252 | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN | Mercedes-Benz |
7166 rows × 13 columns
# Copy for Removing Outliers (since we may adjust)
data_out = data.copy()
numeric_columns = data_out.select_dtypes(include=['int64', 'float64']).columns
# Count outliers for each numeric column (BEFORE removal 1st round)
for col in numeric_columns:
# Calculate IQR and boundaries
Q1 = data_out[col].quantile(0.25)
Q3 = data_out[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
# Count outliers
num_outliers = data_out[(data_out[col] < lower_bound) | (data_out[col] > upper_bound)].shape[0]
# Print results
print(f"Column: {bold_text(col)}")
print(f"Number of outliers: {num_outliers} \n")
Column: Year Number of outliers: 139 Column: Kilometers_Driven Number of outliers: 253 Column: Mileage Number of outliers: 18 Column: Engine Number of outliers: 60 Column: Power Number of outliers: 267 Column: Seats Number of outliers: 1145 Column: New_price Number of outliers: 114 Column: Price Number of outliers: 702
# Boxplot for year and seats
plt.figure(figsize=(15, 2))
plt.subplot(1,2,1)
sns.boxplot(data=data_out, x='Year')
plt.subplot(1,2,2)
sns.boxplot(data=data_out, x='Seats')
plt.show()
# REMOVE OUTLIERS for each numeric column
for col in numeric_columns:
if col not in ['Year', 'Seats']: # ignore checking these columns for outliers
# Calculate IQR and boundaries
Q1 = data_out[col].quantile(0.25)
Q3 = data_out[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
data_out = data_out[(data_out[col] >= lower_bound) & (data_out[col] <= upper_bound)]
data_out
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 7 | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 | Toyota |
| 10 | Kochi | 2018 | 25692 | Petrol | Manual | First | 21.56 | 1462.0 | 103.25 | 5.0 | 10.65 | 9.95 | Maruti |
| 15 | Delhi | 2014 | 110000 | Diesel | Manual | First | 13.50 | 2477.0 | 175.56 | 7.0 | 32.01 | 15.00 | Mitsubishi |
| 28 | Kochi | 2018 | 37430 | Petrol | Manual | First | 17.50 | 1199.0 | 88.70 | 5.0 | 10.57 | 9.90 | Honda |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5999 | Chennai | 2016 | 10000 | Petrol | Manual | First | 17.57 | 1193.0 | 88.70 | 5.0 | 7.77 | 4.00 | Tata |
| 6002 | Mumbai | 2011 | 38000 | Petrol | Manual | First | 16.09 | 1598.0 | 103.50 | 5.0 | 11.91 | 3.25 | Volkswagen |
| 6005 | Pune | 2016 | 37208 | Diesel | Manual | First | 24.30 | 1248.0 | 88.50 | 5.0 | 9.93 | 7.43 | Maruti |
| 6010 | Delhi | 2013 | 33746 | Petrol | Manual | First | 18.50 | 1198.0 | 86.80 | 5.0 | 6.63 | 3.20 | Honda |
| 6014 | Delhi | 2014 | 27365 | Diesel | Manual | First | 28.40 | 1248.0 | 74.00 | 5.0 | 7.88 | 4.75 | Maruti |
622 rows × 13 columns
Observations (pt. 1 - First Glance)
Right away, we can see that removing outliers with the 1.5*IQR threshhold had significant effects:
# Count outliers for each numeric column (AFTER removal)
for col in numeric_columns:
# Calculate IQR and boundaries
Q1 = data_out[col].quantile(0.25)
Q3 = data_out[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
# Count outliers
num_outliers = data_out[(data_out[col] < lower_bound) | (data_out[col] > upper_bound)].shape[0]
# Print results
print(f"Column: {bold_text(col)}")
print(f"Number of outliers: {num_outliers} \n")
Column: Year Number of outliers: 35 Column: Kilometers_Driven Number of outliers: 17 Column: Mileage Number of outliers: 0 Column: Engine Number of outliers: 85 Column: Power Number of outliers: 4 Column: Seats Number of outliers: 99 Column: New_price Number of outliers: 37 Column: Price Number of outliers: 48
Observations (pt. 2 - Outlier Count)
Although 'Year' & 'Seats' were not directly checked for outliers, they were still affected by the outlier removal since rows that outliers in other columns must have also had year values that qualified as outliers. Thus, their counts also change after the removal.
Even after one round of outlier removal, running a count still yields a significant amount of outliers. And while the counts may be a lot smaller (ex: 'Year' outliers went from 58 -> 35), they actually account for significantly larger proportions (ex: 58 outliers / 7253 total records is a lot smaller percentage than 35 outliers / 622 records).
So again, the default outlier removal method is not optimal. With these results, we see that there still remain outliers, and they constitute a much larger proportion of the new data.
data_out.describe()
| Year | Kilometers_Driven | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|
| count | 622.000000 | 622.000000 | 622.000000 | 622.000000 | 622.000000 | 622.000000 | 622.000000 | 622.000000 |
| mean | 2015.937299 | 39313.381029 | 20.160016 | 1435.255627 | 98.462251 | 5.321543 | 11.876592 | 7.773907 |
| std | 2.239506 | 25794.716900 | 4.123303 | 439.167494 | 31.617885 | 0.750407 | 6.890394 | 4.378539 |
| min | 2005.000000 | 600.000000 | 11.250000 | 799.000000 | 53.000000 | 5.000000 | 3.910000 | 1.750000 |
| 25% | 2015.000000 | 20006.000000 | 17.195000 | 1197.000000 | 74.000000 | 5.000000 | 7.375000 | 4.500000 |
| 50% | 2017.000000 | 35120.500000 | 20.140000 | 1248.000000 | 88.500000 | 5.000000 | 10.255000 | 6.525000 |
| 75% | 2017.000000 | 53240.000000 | 23.100000 | 1498.000000 | 117.300000 | 5.000000 | 14.177500 | 9.487500 |
| max | 2019.000000 | 130000.000000 | 28.400000 | 2999.000000 | 227.000000 | 8.000000 | 43.810000 | 20.140000 |
# Remake the copy to adjust the threshold
data_out = data.copy()
numeric_cols = data_out.select_dtypes(include=['int64', 'float64']).columns
data_out
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 | Maruti |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 | Hyundai |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 | Maruti |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 | Audi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7249 | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7250 | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN | Nissan |
| 7251 | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7252 | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN | Mercedes-Benz |
7166 rows × 13 columns
# REMOVE OUTLIERS for each numeric column
for col in numeric_columns:
if col not in ['Year', 'Seats']: # ignore checking these columns for outliers
# Calculate IQR and boundaries
Q1 = data_out[col].quantile(0.25)
Q3 = data_out[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (3 * IQR) # doubling the outlier threshhold to maintain more than
upper_bound = Q3 + (3 * IQR)
data_out = data_out[(data_out[col] >= lower_bound) & (data_out[col] <= upper_bound)]
data_out
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 7 | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 | Toyota |
| 10 | Kochi | 2018 | 25692 | Petrol | Manual | First | 21.56 | 1462.0 | 103.25 | 5.0 | 10.65 | 9.95 | Maruti |
| 15 | Delhi | 2014 | 110000 | Diesel | Manual | First | 13.50 | 2477.0 | 175.56 | 7.0 | 32.01 | 15.00 | Mitsubishi |
| 20 | Kochi | 2014 | 32982 | Diesel | Automatic | First | 22.69 | 1995.0 | 190.00 | 5.0 | 47.87 | 18.55 | BMW |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5999 | Chennai | 2016 | 10000 | Petrol | Manual | First | 17.57 | 1193.0 | 88.70 | 5.0 | 7.77 | 4.00 | Tata |
| 6002 | Mumbai | 2011 | 38000 | Petrol | Manual | First | 16.09 | 1598.0 | 103.50 | 5.0 | 11.91 | 3.25 | Volkswagen |
| 6005 | Pune | 2016 | 37208 | Diesel | Manual | First | 24.30 | 1248.0 | 88.50 | 5.0 | 9.93 | 7.43 | Maruti |
| 6010 | Delhi | 2013 | 33746 | Petrol | Manual | First | 18.50 | 1198.0 | 86.80 | 5.0 | 6.63 | 3.20 | Honda |
| 6014 | Delhi | 2014 | 27365 | Diesel | Manual | First | 28.40 | 1248.0 | 74.00 | 5.0 | 7.88 | 4.75 | Maruti |
765 rows × 13 columns
data_out.describe()
| Year | Kilometers_Driven | Mileage | Engine | Power | Seats | New_price | Price | |
|---|---|---|---|---|---|---|---|---|
| count | 765.000000 | 765.000000 | 765.000000 | 765.000000 | 765.000000 | 765.000000 | 765.000000 | 765.000000 |
| mean | 2015.797386 | 41132.094118 | 20.016758 | 1539.243137 | 110.990170 | 5.312418 | 17.761242 | 10.580157 |
| std | 2.391519 | 28408.824630 | 4.300636 | 483.976108 | 41.933828 | 0.772847 | 15.490169 | 8.149830 |
| min | 2001.000000 | 600.000000 | 11.250000 | 799.000000 | 53.000000 | 4.000000 | 3.910000 | 0.700000 |
| 25% | 2015.000000 | 20712.000000 | 17.000000 | 1197.000000 | 81.800000 | 5.000000 | 7.870000 | 4.900000 |
| 50% | 2016.000000 | 35569.000000 | 20.140000 | 1461.000000 | 93.700000 | 5.000000 | 11.260000 | 7.690000 |
| 75% | 2017.000000 | 54387.000000 | 22.950000 | 1956.000000 | 143.000000 | 5.000000 | 21.690000 | 13.200000 |
| max | 2019.000000 | 170000.000000 | 33.540000 | 2999.000000 | 246.740000 | 8.000000 | 71.820000 | 40.000000 |
This looks significantly better than before, so we'll update the "data" set using this second version of outlier removal since we still have about 10% of the original data.
There is likely still some skew to the features, which will be dealt with upcoming methods.
# Looking at "data" boxplot before settling on outlier removal system
import matplotlib.pyplot as plt
plt.figure(figsize=(5, 1))
sns.boxplot(data=data, x='Mileage')
plt.show()
# Boxplot for "data_out" after applying 3*IQR threshhold
import matplotlib.pyplot as plt
plt.figure(figsize=(5, 1))
sns.boxplot(data=data_out, x='Mileage')
plt.show()
Also looks decent (the remaining outliers are still pretty close to the data)
We'll copy data_out --> data with this 3xIQR threshhold.
# Make another copy of "data" before duplicating "data_out" so that we can compare some analyses later on
data_orig = data.copy()
data_orig
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.60 | 998.0 | 58.16 | 5.0 | NaN | 1.75 | Maruti |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 | 1582.0 | 126.20 | 5.0 | NaN | 12.50 | Hyundai |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 | 1248.0 | 88.76 | 7.0 | NaN | 6.00 | Maruti |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.20 | 1968.0 | 140.80 | 5.0 | NaN | 17.74 | Audi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 | 1598.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7249 | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7250 | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 | 1461.0 | 63.10 | 5.0 | NaN | NaN | Nissan |
| 7251 | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.20 | 1197.0 | 103.60 | 5.0 | NaN | NaN | Volkswagen |
| 7252 | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.00 | 2148.0 | 170.00 | 5.0 | NaN | NaN | Mercedes-Benz |
7166 rows × 13 columns
# Now update the "data" set with the 3xIQR outlier removal threshhold
data = data_out
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 7 | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 | Toyota |
| 10 | Kochi | 2018 | 25692 | Petrol | Manual | First | 21.56 | 1462.0 | 103.25 | 5.0 | 10.65 | 9.95 | Maruti |
| 15 | Delhi | 2014 | 110000 | Diesel | Manual | First | 13.50 | 2477.0 | 175.56 | 7.0 | 32.01 | 15.00 | Mitsubishi |
| 20 | Kochi | 2014 | 32982 | Diesel | Automatic | First | 22.69 | 1995.0 | 190.00 | 5.0 | 47.87 | 18.55 | BMW |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5999 | Chennai | 2016 | 10000 | Petrol | Manual | First | 17.57 | 1193.0 | 88.70 | 5.0 | 7.77 | 4.00 | Tata |
| 6002 | Mumbai | 2011 | 38000 | Petrol | Manual | First | 16.09 | 1598.0 | 103.50 | 5.0 | 11.91 | 3.25 | Volkswagen |
| 6005 | Pune | 2016 | 37208 | Diesel | Manual | First | 24.30 | 1248.0 | 88.50 | 5.0 | 9.93 | 7.43 | Maruti |
| 6010 | Delhi | 2013 | 33746 | Petrol | Manual | First | 18.50 | 1198.0 | 86.80 | 5.0 | 6.63 | 3.20 | Honda |
| 6014 | Delhi | 2014 | 27365 | Diesel | Manual | First | 28.40 | 1248.0 | 74.00 | 5.0 | 7.88 | 4.75 | Maruti |
765 rows × 13 columns
Questions:
# Defining a variable for numeric columns
numeric_columns = data_orig.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(numeric_columns) # print the list
['Year', 'Kilometers_Driven', 'Mileage', 'Engine', 'Power', 'Seats', 'New_price', 'Price']
# Create the axes object with the correct number of rows and columns:
# Eight variables (without 'S.No.') / two columns = four rows
num_cols = 3
# 'Adjustment' variable to remove any empty grids at the end
adjustment = num_cols - 1 # Subtracting 1 from the num_cols allows the 'adjustment' to ensure at least one empty column at the end of each row
# The 'adjustment' variable is added to the amount of variables so that it yields enough rows...
# -> after the integer division of "// num_cols", which will truncate the quotient.
# -> So we want to make sure it there's enough space for all the plots.
num_rows = (len(numeric_columns) + adjustment) // num_cols # Adjust this value based on the number of columns (which are all numeric for this data)
axes = plt.subplots(num_rows, num_cols, figsize=(20, 10))[1] # the [1] extracts the second element (axes object) from the tuple
# Loop throw the variables and plot each.
for i, col in enumerate(numeric_columns):
ax = axes[i // num_cols, i % num_cols]
sns.histplot(data=data_orig, x=col, kde=True, ax=ax) # although maybe not technically necessary in this case of straight-forward data...
# -> setting 'x=col' helps with efficiency so that the function does not need...
# -> to try and infer what should be on the x-axis.
ax.set_title(col) # even though the x-axis already has the variable, a title is larger and helps reading the output more clearly
# Hide the empty axes since we included the adjustment which adds empty columns at the end of each row.
for i in range(len(numeric_columns), num_rows * num_cols):
axes[i // num_cols, i % num_cols].set_visible(False)
# Display the plot
plt.tight_layout() # automatically adjusts the spacing since we have multiple plots in a confined space
plt.show()
Numeric Values Distributions Observations
Left Skew
Normal
Right Skew
Multimodal
# Histplots after removing outliers:
num_cols = 3
adjustment = num_cols - 1
num_rows = (len(numeric_columns) + adjustment) // num_cols
axes = plt.subplots(num_rows, num_cols, figsize=(20, 10))[1]
# Loop throw the variables and plot each.
for i, col in enumerate(numeric_columns):
ax = axes[i // num_cols, i % num_cols]
sns.histplot(data=data, x=col, kde=True, ax=ax)
ax.set_title(col)
# Hide the empty axes since we included the adjustment which adds empty columns at the end of each row.
for i in range(len(numeric_columns), num_rows * num_cols):
axes[i // num_cols, i % num_cols].set_visible(False)
# Display the plot
plt.tight_layout()
plt.show()
Post-Outlier Removal Distributions Observations:
'Year' actually looks more skewed now + the axis has decimals. Neither of these things should really be an issue since it's still a pretty small range.
'Seats' shrunk its range a bit since nearly all of the data is either 5 or 7 seats.
The other features, while not quite normally distributed now, are certainly better than they were before removing outliers.
# Copy for Logged Experiment
data_logged = data.copy()
# Performing Log Transformation EXCEPT 'Year' & 'Seats' since they're discrete & logging would hurt interpretability
for col in numeric_columns:
if (col != 'Year'):
if (col != 'Seats'):
data_logged[col] = np.log1p(data_logged[col])
# Create the axes object with the correct number of rows and columns:
num_cols = 3
adjustment = num_cols - 1
num_rows = (len(numeric_columns) + adjustment) // num_cols
axes = plt.subplots(num_rows, num_cols, figsize=(20, 10))[1]
# Loop throw the variables and plot each.
for i, col in enumerate(numeric_columns):
ax = axes[i // num_cols, i % num_cols]
sns.histplot(data=data_logged, x=col, kde=True, ax=ax)
ax.set_title(col)
# Hide the empty axes since we included the adjustment which adds empty columns at the end of each row.
for i in range(len(numeric_columns), num_rows * num_cols):
axes[i // num_cols, i % num_cols].set_visible(False)
# Display the plot
plt.tight_layout()
plt.show()
Post-Log Observations
No change for 'Year' & 'Seats', obviously, since we excluded them from logging.
Surprisingly, 'Kilometers_Driven' flipped from right-skewed to left-skewed (although with a less dramatic tail). So a log transformation did not help to normalize that distribution. Further consideration for this variable will be needed moving forward when building models.
'Mileage' & 'Engine' look similar to before (but with adjusted x-axes). So we'll actually exclude them from logging since it didn't really help & we want to try to preserve original data as much as possible.
'Power' was kind of leveled out, but the skew before was not extreme, so we'll likely exclude this from logging, as well, to preserve the original values. A look at its pre-logged boxplot will also help determine if a log transformation is really called for.
'New_price' & 'Price' are far closer to normal distribution (especially 'Price'), which is the goal of log transformation.
# prompt: boxplot power
plt.figure(figsize=(5, 1))
sns.boxplot(data=data, x='Power')
plt.show()
# prompt: boxplot power
plt.figure(figsize=(5, 1))
sns.boxplot(data=data_logged, x='Power')
plt.show()
Pre/Post Log Impact on 'Power':
So while the log transformation did remove a single outlier, that outlier was not far off the whisker.
And while the whiskers are more equal in length, the box shape has not changed much & the median is still farther to the left, so the mean would still be higher --> still some right-skew. Thus, it's not worth performing a log transformation on it.
CONCLUSION after Log Experimentation:
Really, only the two price variables benefitted from performing a log transformation.
However, as we previously saw, 'New_price' has so many empty values (~86%), it will likely be dropped anyway.
So as a result, only Price will need a log transformation --> just add a column instead of logging the whole DataFrame.
Questions:
# Scatter plot for numeric values
sns.pairplot(data, vars = numeric_columns, diag_kind = 'kde')
plt.show()
Pairplot Observations (Visual)
'Year' is mostly above 2008, and its only visibly discernbile correlation is with 'Price': a positive slope along the lower-right border. This makes sense that a later car (more recent) would cost more.
'Kilometers_Driven' - Most records are over 100,000
'Mileage' - Mostly under 30
'Engine' has a bit of a positive correlation with 'Power' & somewhat so with the two prices ('New_price' & 'Price').
'Power' is mostly under 200 & has a strong positive correlation with 'New_Price' and a bit of one with 'Price'.
'Seats' has flat lines (since it's discrete) and mostly at 5 & 7 (which we already knew).
'New_Price' has a generally strong positive correlation with 'Price'.
Interpretations
The direct proportionality between 'Engine' & 'Power' makes sense since a larger engine should be able to have more horsepower.
'New_Price' logically rises as the used 'Price' rises.
Contextually, there's no obvious reason why a car that has more mileage should have a smaller engine or power; however, it does make sense that the correlation is similar between the mileage and those two variables since the latter two have a strong collinearity with each other.
# Heatmap for numeric columns
corr = data.corr(numeric_only=True) # for some reason, I had to explicitly specify numeric_only because .corr started trying to use location column
plt.figure(figsize=(10, 5))
plt.title('Correlation Heatmap for Numeric Columns')
### Show just LOWER Triangle since the other corner has duplicate values ###
# 1) 'corr.where' = makes a separate matrix that gives you more control (i.e. in this case, showing just one triangle)
# 2) 'np.tril' = extracts just the lower triangle (as opposed to 'np.triu' for the upper)
# 3) 'np.ones' = creates a matrix of ones (matching the shape of the 'corr' matrix).
# Helps w/ readability to emphasize that this is a boolean mask (to just show lower triangle)
# It could also help with efficiency since the function is highly optimized
# 4) 'k=-1' = removes the main diagonal where each variable matches with itself (1.0)
# 5) '.astype(bool)' = used in conjunction with the 'tril' boolean mask function to extract the lower triangle
lower = corr.where(np.tril(np.ones(corr.shape), k=-1).astype(bool))
# Create the actual heatmap with the lower triangle
# 1) the 'coolwarm' color map gives an intuitive aesthetic since this is a "heatmap" afterall
# 2) 'fmt' = sets the rounding (i.e. ".3f" -> three decimal places)
sns.heatmap(lower, annot=True, cmap='coolwarm', fmt=".3f")
plt.show()
Heatmap Observations
The strongest positive correlations (>0.80) are between:
On the negative side, there were no significant correlations (-0.627 was the largest).
Results --> Plan
'New_price' has so many missing values (around 90%) AND suggests high multicollinearity with both of the other predictors that have strong correlation with our target 'Price' --> it will be dropped when building models.
We also have a probable multicollinearity between 'Power' & 'Engine' (i.e. a larger engine has more power: logical) with a factor of 0.859. But since the 'Engine' specifically has a lower correlation with our target 'Price' than 'Power' (0.747 vs. 0.830, respectively), we'll likely drop 'Engine', as well.
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_price | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 8.61 | 4.50 | Honda |
| 7 | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 21.00 | 17.50 | Toyota |
| 10 | Kochi | 2018 | 25692 | Petrol | Manual | First | 21.56 | 1462.0 | 103.25 | 5.0 | 10.65 | 9.95 | Maruti |
| 15 | Delhi | 2014 | 110000 | Diesel | Manual | First | 13.50 | 2477.0 | 175.56 | 7.0 | 32.01 | 15.00 | Mitsubishi |
| 20 | Kochi | 2014 | 32982 | Diesel | Automatic | First | 22.69 | 1995.0 | 190.00 | 5.0 | 47.87 | 18.55 | BMW |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5999 | Chennai | 2016 | 10000 | Petrol | Manual | First | 17.57 | 1193.0 | 88.70 | 5.0 | 7.77 | 4.00 | Tata |
| 6002 | Mumbai | 2011 | 38000 | Petrol | Manual | First | 16.09 | 1598.0 | 103.50 | 5.0 | 11.91 | 3.25 | Volkswagen |
| 6005 | Pune | 2016 | 37208 | Diesel | Manual | First | 24.30 | 1248.0 | 88.50 | 5.0 | 9.93 | 7.43 | Maruti |
| 6010 | Delhi | 2013 | 33746 | Petrol | Manual | First | 18.50 | 1198.0 | 86.80 | 5.0 | 6.63 | 3.20 | Honda |
| 6014 | Delhi | 2014 | 27365 | Diesel | Manual | First | 28.40 | 1248.0 | 74.00 | 5.0 | 7.88 | 4.75 | Maruti |
765 rows × 13 columns
# Get the pairs with strong correlations
strong_pairs = corr.unstack().sort_values(ascending=False).drop_duplicates()
strong_pairs = strong_pairs[(abs(strong_pairs) >= 0.70) & (abs(strong_pairs) < 1.0)]
# Loop through the pairs and plot the bivariate analysis
for col1, col2 in strong_pairs.index:
plt.figure(figsize=(4, 3))
plt.title(f'{col1} & {col2} w/ correlation factor = {round(data[col1].corr(data[col2]), 3)}')
sns.regplot(x=data[col1], y=data[col2]) # '.regplot' fits a line of best fit onto a scatterplot (hence, "regression")
plt.show()
All "strong" correlations (absolute value at least 0.7) are positive
The lines have tight confidence bands with a few that slightly widen towards the end. This makes sense since they're strong correlations with the widening happening when the data is a little more spread out.
The slope of 'Power' vs. 'New_price' is steeper than one might assume because of that vertical arrangement of dots @ Power just below 200. I suspect that if those weren't there, the line would be more flat but with a higher correlation factor.
We'll run box plots comparing 'New_price' & 'Price' with 'Location' & the newly introduced 'Manufacturer'.
And if a boxplot shows significant skews, we'll run a violin plot since that shows densities (which help with skewed data).
# Box plot for LOCATION vs. NEW PRICE
sns.boxplot(data=data, hue='Location', x='New_price', y='Location')
plt.xticks()
plt.title('Comparing NEW Prices in each Location')
plt.show()
# Violin plot for LOCATION vs. NEW PRICE
plt.figure(figsize=(6, 7))
sns.violinplot(data=data, hue='Location', x='New_price', y='Location')
plt.xticks()
plt.title('Comparing NEW Prices in each Location')
plt.show()
Location NEW Price Observations
The locations have pretty similar violins (although you can, again, see the differing boxes within).
Two locations do stand out a bit since they are a bit thicker in the upper end (while the rest are pretty narrow past 3/3.5): Bangalore & Ahmedabad. This indicates that they have more data with higher new prices.
# Box plot for MANUFACTURER (newly introduced) vs. NEW PRICE
plt.figure(figsize=(15, 5)) # must be BEFORE the plot call to set the size
sns.boxplot(data=data, hue='Manufacturer', x='Manufacturer', y='New_price')
plt.title('Comparing NEW Prices in each Manufacturer') # must be AFTER the plot call to avoid a blank plot first
plt.xticks(rotation=70) # rotation is counter-clockwise
plt.show()
# Box plot for MANUFACTURER (newly introduced) vs. NEW PRICE
plt.figure(figsize=(15, 5)) # must be BEFORE the plot call to set the size
sns.violinplot(data=data, hue='Manufacturer', x='Manufacturer', y='New_price')
plt.title('Comparing NEW Prices in each Manufacturer') # must be AFTER the plot call to avoid a blank plot first
plt.xticks(rotation=70) # rotation is counter-clockwise
plt.show()
Manufacturer NEW Price Observations
The more tightly packed Manufacturers (i.e. their prices are more consistent) are: Maruti, Mitsubishi, Datsun, Jaguar & Jeep.
On the other hand, there is significant variance with companies in two general ways:
# Box plot for LOCATION vs. PRICE (target)
sns.boxplot(data=data, hue='Location', x='Price', y='Location')
plt.xticks()
plt.title('Comparing TARGET Prices in each Location')
plt.show()
Location/Price Boxplot Observations
In general, the locations have somewhat similar distributions of price. There are certainly some tighter boxplots, but the fact that they're relatively similar ranges is likely due, in part, to the log transformation & removal of outliers.
A couple groupings of distribution widths:
Other than Chennai, there does not seem to be any "extreme" skew here, so we'll skip a violin plot.
# Violin plot for LOCATION vs. PRICE (target)
plt.figure(figsize=(7, 6))
sns.violinplot(data=data, hue='Location', x='Price', y='Location')
plt.xticks()
plt.title('Comparing TARGET Prices in each Location')
plt.show()
# Box plot for MANUFACTURER (newly introduced) vs. PRICE (target)
plt.figure(figsize=(15, 5)) # must be BEFORE the plot call to set the size
sns.boxplot(data=data, hue='Manufacturer', x='Manufacturer', y='Price')
plt.title('Comparing TARGET Prices in each Manufacturer') # must be AFTER the plot call to avoid a blank plot first
plt.xticks(rotation=90) # rotation is counter-clockwise
plt.show()
# Box plot for MANUFACTURER (newly introduced) vs. PRICE (target)
plt.figure(figsize=(15, 5)) # must be BEFORE the plot call to set the size
sns.violinplot(data=data, hue='Manufacturer', x='Manufacturer', y='Price')
plt.title('Comparing TARGET Prices in each Manufacturer') # must be AFTER the plot call to avoid a blank plot first
plt.xticks(rotation=90) # rotation is counter-clockwise
plt.show()
Manufacturer Target Price Distribution Observations
The more tightly packed Manufacturers (i.e. their prices are more consistent) are: Mitsubishi, Nissan, Datsun, Jeep & BMW.
On the other hand, there is significant variance with companies like Skoda (especially), Toyota, Honda, Maruti & Hyuandai. Part of this could be due to them having more records and thus, a wider range of models (which would lead to a wider range of prices).
# Average Target Price vs. Manufacturer
plt.figure(figsize=(8, 6.5))
sns.barplot(y = 'Manufacturer', x = 'Price', data = data, hue='Manufacturer')
plt.title('Average Target Price vs. Manufacturer')
plt.show()
print('There are ', data['Manufacturer'].nunique(), ' brands remaining.')
There are 22 brands remaining.
Observations
Fiat has the lowest average target price (~ 2.5) & Jaguar has the highest (~37.5).
After removing rows w/ insufficient data & outliers, we have 22 brands remaining.
# Average Target Price vs. Location
plt.figure(figsize=(8, 5))
sns.barplot(y = 'Location', x = 'Price', data = data, hue='Location')
plt.title('Average Target Price vs. Location')
plt.show()
print('There are ', data['Location'].nunique(), ' locations remaining.')
There are 11 locations remaining.
Think about it: The Name column in the current format might not be very useful in our analysis.
Since the name contains both the brand name and the model name of the vehicle, the column would have too many unique values to be useful in prediction. Can we extract that information from that column?
Brand Name Extraction
We already did that earlier in the code since it seemed obvious to do towards the beginning after seeing how many different unique model names there were.
Plus, brand name (or as I called it above, "Manufacturer") is certainly a consideration when pricing & buying a car (which is the main objective of this project: identify features that affect a used car price to help consumers make a decision).
The overall method of extracting the brand name was to use the split() function on the 'Name' field in order to create a list with each word as a separate item in that list.
Then, with only one brand exception, every manufacturer was the first word in that list. So we just saved that first word as the brand name in a separate column.
Land Rover was the one exception, but that was straight-forward to create an "if" branch to see if the first word was "Land" --> so then the brand name ('Manufacturer') was saved as the first two words (i.e. "Land Rover").
We made sure to check the list of brand names to see if there were any other brands to look for which had two words, there were none as can be verified below in the list.
# List of brand names
data['Manufacturer'].unique()
array(['Honda', 'Toyota', 'Maruti', 'Mitsubishi', 'BMW', 'Volkswagen',
'Hyundai', 'Nissan', 'Tata', 'Datsun', 'Mahindra', 'Skoda', 'Mini',
'Jaguar', 'Audi', 'Jeep', 'Ford', 'Renault', 'Volvo', 'Fiat',
'Mercedes-Benz', 'Isuzu'], dtype=object)
# Set a variable to drop (in case you wanted to do something else)
column_to_drop = 'New_price'
# Check if the column exists in the DataFrame
if column_to_drop in data.columns:
# Drop the column if it exists
data.drop(columns=(column_to_drop), inplace=True)
data
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | Price | Manufacturer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.20 | 1199.0 | 88.70 | 5.0 | 4.50 | Honda |
| 7 | Mumbai | 2016 | 36000 | Diesel | Automatic | First | 11.36 | 2755.0 | 171.50 | 8.0 | 17.50 | Toyota |
| 10 | Kochi | 2018 | 25692 | Petrol | Manual | First | 21.56 | 1462.0 | 103.25 | 5.0 | 9.95 | Maruti |
| 15 | Delhi | 2014 | 110000 | Diesel | Manual | First | 13.50 | 2477.0 | 175.56 | 7.0 | 15.00 | Mitsubishi |
| 20 | Kochi | 2014 | 32982 | Diesel | Automatic | First | 22.69 | 1995.0 | 190.00 | 5.0 | 18.55 | BMW |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5999 | Chennai | 2016 | 10000 | Petrol | Manual | First | 17.57 | 1193.0 | 88.70 | 5.0 | 4.00 | Tata |
| 6002 | Mumbai | 2011 | 38000 | Petrol | Manual | First | 16.09 | 1598.0 | 103.50 | 5.0 | 3.25 | Volkswagen |
| 6005 | Pune | 2016 | 37208 | Diesel | Manual | First | 24.30 | 1248.0 | 88.50 | 5.0 | 7.43 | Maruti |
| 6010 | Delhi | 2013 | 33746 | Petrol | Manual | First | 18.50 | 1198.0 | 86.80 | 5.0 | 3.20 | Honda |
| 6014 | Delhi | 2014 | 27365 | Diesel | Manual | First | 28.40 | 1248.0 | 74.00 | 5.0 | 4.75 | Maruti |
765 rows × 12 columns
data_orig.isnull().sum() # before removing outliers
Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Engine 34 Power 129 Seats 39 New_price 6165 Price 1218 Manufacturer 0 dtype: int64
data.isnull().sum()
Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 0 Engine 0 Power 0 Seats 0 Price 0 Manufacturer 0 dtype: int64
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
There were a lot of missing values for price data (especially 'New_price').
Outliers and skewedness present another challenge.
In fact, I spent too much time looking up MPGe (equivalent MPG for electric cars) and finding mileage data for CNG cars (Compressed Natural Gas) to fill in those blanks. But when I ran the outlier removal, those rows dropped out anyway. In fact, even with a wider outlier removal thresshold of 3xIQR (instead of 1.5), I believe any rows with blanks happened to drop out.
The numerical values are also on vastly different scales (ex: 'Seats' << 'Year' << 'Kilometers_Driven')
The large number of unique car names (which also include the manufacturer within that field) make it pointless to analyze --> extracting the manufacturer can help since it:
'Seats' was inputted as a float instead of an int. I didn't think it would be an issue until I got down to the Decision Tree and saw that the root node had a break of "Seats <= 131.1", which obviously doesn't make any sense.
# Make column for price_log
data['price_log'] = np.log(data['Price'] + 1) # Add one to ensure positive result
Question:
To answer the question directly above:
'Name' has over two thousand uniques, impractical to analyze as a category
We drop the target variable 'Price' so that the models can calculate effectiveness (error, fit, etc.) using that as the dependent.
For 'Kilometers_Driven', I will run the models first with it still there and then compare to if it's logged (like 'Price') because there's no reason obvious to me to drop it.
# To build models for prediction
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor
# To encode categorical variables
from sklearn.preprocessing import LabelEncoder
# For tuning the model
from sklearn.model_selection import GridSearchCV
# To check model performance
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
# Import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import math # for reversing log transformation on 'Price'
# Split the data into train and test
X = data.drop(['Price', 'Engine', 'price_log'], axis=1) # also dropping 'Price' since we're using 'price_log' for model training instead
y = data['price_log']
# Encode categorical variables
X = pd.get_dummies(X, columns=['Location','Fuel_Type','Transmission','Owner_Type', 'Manufacturer'])
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.20, # sets 20% for test data & leaving 80% for training data
shuffle = True, # reorders the list when taking random samples
random_state = 1 # sets a seed to reproduce results while comparing different model runs
)
# Check splitting size results
print('training data = ', len(X_train), ' = ', (len(X_train)/len(X))*100, '%')
print('test data = ', len(X_test), ' = ', (len(X_test)/len(X))*100, '%')
print('total = ', len(X_train) + len(X_test))
print('pre-split length = ', len(X))
training data = 612 = 80.0 % test data = 153 = 20.0 % total = 765 pre-split length = 765
# Cast the input data to float64 ("float" Python command converts to NumPy float64 type)
X_train = X_train.astype(float) # this was not necessary on April 1, 2024; but now it's required since statsmodels can suddenly not handle boolean
X_test = X_test.astype(float)
# Function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions) # measures variance in MEDV
n = predictors.shape[0] # number of observations in the dataset (= rows here, since no missing values)
k = predictors.shape[1] # number of independent variables in the model
return 1 - ((1 - r2) * (n - 1) / (n - k - 1)) # standard formula for calculating adjusted R-squared
# Function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# Function to compute different performance metrics to check regression model
def model_performance_regression(model, predictors, target):
'''
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
'''
pred = model.predict(predictors) # Predict using the independent variables
r2 = r2_score(target, pred)
adjr2 = adj_r2_score(predictors, target, pred)
rmse = np.sqrt(mean_squared_error(target, pred))
mae = mean_absolute_error(target, pred)
mape = mape_score(target, pred)
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# Train the BASE model (no constant)
olsmodel = sm.OLS(y_train, X_train).fit()
# Get the model summary
olsmodel.summary()
print(olsmodel.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.927
Model: OLS Adj. R-squared: 0.922
Method: Least Squares F-statistic: 177.6
Date: Fri, 12 Apr 2024 Prob (F-statistic): 1.43e-295
Time: 04:13:21 Log-Likelihood: 222.73
No. Observations: 612 AIC: -361.5
Df Residuals: 570 BIC: -176.0
Df Model: 41
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Year 0.0934 0.004 22.246 0.000 0.085 0.102
Kilometers_Driven -1.92e-06 3.45e-07 -5.570 0.000 -2.6e-06 -1.24e-06
Mileage -0.0165 0.004 -3.968 0.000 -0.025 -0.008
Power 0.0088 0.000 19.015 0.000 0.008 0.010
Seats -0.0485 0.016 -2.970 0.003 -0.081 -0.016
Location_Ahmedabad -12.9347 0.589 -21.975 0.000 -14.091 -11.779
Location_Bangalore -12.8891 0.591 -21.795 0.000 -14.051 -11.728
Location_Chennai -12.9847 0.588 -22.077 0.000 -14.140 -11.829
Location_Coimbatore -12.9270 0.595 -21.743 0.000 -14.095 -11.759
Location_Delhi -13.0496 0.591 -22.074 0.000 -14.211 -11.888
Location_Hyderabad -12.9263 0.591 -21.888 0.000 -14.086 -11.766
Location_Jaipur -12.9946 0.591 -21.973 0.000 -14.156 -11.833
Location_Kochi -13.0166 0.596 -21.856 0.000 -14.186 -11.847
Location_Kolkata -13.1740 0.589 -22.366 0.000 -14.331 -12.017
Location_Mumbai -13.0124 0.589 -22.110 0.000 -14.168 -11.856
Location_Pune -13.0205 0.593 -21.941 0.000 -14.186 -11.855
Fuel_Type_CNG -47.6135 2.165 -21.990 0.000 -51.866 -43.361
Fuel_Type_Diesel -47.5064 2.168 -21.910 0.000 -51.765 -43.248
Fuel_Type_Petrol -47.8097 2.163 -22.100 0.000 -52.059 -43.561
Transmission_Automatic -71.4361 3.249 -21.984 0.000 -77.818 -65.054
Transmission_Manual -71.4935 3.247 -22.022 0.000 -77.870 -65.117
Owner_Type_First -47.6804 2.174 -21.930 0.000 -51.951 -43.410
Owner_Type_Second -47.6875 2.170 -21.973 0.000 -51.950 -43.425
Owner_Type_Third -47.5617 2.153 -22.088 0.000 -51.791 -43.332
Manufacturer_Audi -6.2520 0.297 -21.038 0.000 -6.836 -5.668
Manufacturer_BMW -6.3157 0.293 -21.533 0.000 -6.892 -5.740
Manufacturer_Datsun -6.7596 0.307 -22.004 0.000 -7.363 -6.156
Manufacturer_Fiat -6.9722 0.341 -20.425 0.000 -7.643 -6.302
Manufacturer_Ford -6.5625 0.300 -21.859 0.000 -7.152 -5.973
Manufacturer_Honda -6.4909 0.297 -21.868 0.000 -7.074 -5.908
Manufacturer_Hyundai -6.5693 0.298 -22.011 0.000 -7.155 -5.983
Manufacturer_Isuzu -6.5865 0.341 -19.320 0.000 -7.256 -5.917
Manufacturer_Jaguar -6.3692 0.323 -19.744 0.000 -7.003 -5.736
Manufacturer_Jeep -6.6290 0.306 -21.670 0.000 -7.230 -6.028
Manufacturer_Mahindra -6.5943 0.299 -22.025 0.000 -7.182 -6.006
Manufacturer_Maruti -6.4917 0.298 -21.821 0.000 -7.076 -5.907
Manufacturer_Mercedes-Benz -6.1169 0.301 -20.341 0.000 -6.708 -5.526
Manufacturer_Mini -5.8159 0.297 -19.576 0.000 -6.399 -5.232
Manufacturer_Mitsubishi -6.5283 0.306 -21.346 0.000 -7.129 -5.928
Manufacturer_Nissan -6.5768 0.297 -22.164 0.000 -7.160 -5.994
Manufacturer_Renault -6.6399 0.301 -22.088 0.000 -7.230 -6.049
Manufacturer_Skoda -6.6184 0.299 -22.103 0.000 -7.207 -6.030
Manufacturer_Tata -6.7346 0.302 -22.307 0.000 -7.328 -6.142
Manufacturer_Toyota -6.4240 0.297 -21.613 0.000 -7.008 -5.840
Manufacturer_Volkswagen -6.6463 0.296 -22.477 0.000 -7.227 -6.066
Manufacturer_Volvo -6.2357 0.345 -18.090 0.000 -6.913 -5.559
==============================================================================
Omnibus: 35.294 Durbin-Watson: 1.871
Prob(Omnibus): 0.000 Jarque-Bera (JB): 105.621
Skew: -0.184 Prob(JB): 1.16e-23
Kurtosis: 5.002 Cond. No. 1.22e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.02e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print('Performance on ' + bold_text('TRAIN') + ' data: ')
lin_reg_train = model_performance_regression(olsmodel, X_train, y_train)
lin_reg_train
Performance on TRAIN data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168153 | 0.128137 | 0.927406 | 0.921496 | 6.322593 |
print('Performance on ' + bold_text('TEST') + ' data: ')
# Make a variable for comparison later
lin_reg_test = model_performance_regression(olsmodel, X_test, y_test)
lin_reg_test
Performance on TEST data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.195618 | 0.137312 | 0.902631 | 0.860377 | 6.648424 |
TRAIN vs. TEST Observations:
The errors (RMSE, MAE & MAPE) all slightly increased
While the fit (R-sq & Adj. R-sq) slightly decreased (which makes sense if the errors increased since they're generally inversely related).
Although it's not a drastic difference, it could suggest slight overfitting. So let's consider the OLS Summary results.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# Calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
print(checking_vif(X_train))
feature VIF 0 Year 2.067078 1 Kilometers_Driven 1.963702 2 Mileage 6.390740 3 Power 7.204895 4 Seats 3.239493 5 Location_Ahmedabad inf 6 Location_Bangalore inf 7 Location_Chennai inf 8 Location_Coimbatore inf 9 Location_Delhi inf 10 Location_Hyderabad inf 11 Location_Jaipur inf 12 Location_Kochi inf 13 Location_Kolkata inf 14 Location_Mumbai inf 15 Location_Pune inf 16 Fuel_Type_CNG inf 17 Fuel_Type_Diesel inf 18 Fuel_Type_Petrol inf 19 Transmission_Automatic inf 20 Transmission_Manual inf 21 Owner_Type_First inf 22 Owner_Type_Second inf 23 Owner_Type_Third inf 24 Manufacturer_Audi inf 25 Manufacturer_BMW inf 26 Manufacturer_Datsun inf 27 Manufacturer_Fiat inf 28 Manufacturer_Ford inf 29 Manufacturer_Honda inf 30 Manufacturer_Hyundai inf 31 Manufacturer_Isuzu inf 32 Manufacturer_Jaguar inf 33 Manufacturer_Jeep inf 34 Manufacturer_Mahindra inf 35 Manufacturer_Maruti inf 36 Manufacturer_Mercedes-Benz inf 37 Manufacturer_Mini inf 38 Manufacturer_Mitsubishi inf 39 Manufacturer_Nissan inf 40 Manufacturer_Renault inf 41 Manufacturer_Skoda inf 42 Manufacturer_Tata inf 43 Manufacturer_Toyota inf 44 Manufacturer_Volkswagen inf 45 Manufacturer_Volvo inf
VIF Observations
We have two variables that have scores > 5. In descending order:
But both are valuable factors to consider when pricing & buying a used car, so dropping either does not fit the context. We'll try regularization methods to address this, but it's unlikely to help much since the data lacks the traits which would benefit from these. Let's see...
# Run Ridge on most recent data (one round with constant)
ridge_model = Ridge()
ridge_model.fit(X_train, y_train)
Ridge()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge()
# Running it on training data w/ default alpha=1
model_performance_regression(ridge_model, X_train, y_train)
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.169035 | 0.129223 | 0.926643 | 0.92067 | 6.379982 |
# On test data
ridge_reg = model_performance_regression(ridge_model, X_test, y_test)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196574 | 0.13778 | 0.901678 | 0.85901 | 6.655618 |
L2 / Ridge Observations:
The errors (RMSE, MAE & MAPE) all had slight increases in both the train & test data.
Unsurprisingly, the fit for both sets (R-sq & Adj R-sq) had minor dips.
We'll see if L1 / Lasso has different results.
# Run Lasso on most recent data (one round with constant)
lasso_model = Lasso()
lasso_model.fit(X_train, y_train)
Lasso()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso()
# Running it on training data w/ default alpha=1
model_performance_regression(lasso_model, X_train, y_train)
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.327777 | 0.245776 | 0.724166 | 0.701709 | 12.445443 |
# On test data
lasso_reg = model_performance_regression(lasso_model, X_test, y_test)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.312122 | 0.241016 | 0.752115 | 0.644543 | 11.885197 |
L1 / Lasso Observations:
The TRAINING errors nearly doubled, and the R-sq fits lost over 20% from the Ridge Regression.
And the significantly lower fits suggest that Lasso is not a good model (at least with the default alpha) for this data.
Possible reasons for this could be:
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Ridge()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Ridge()
Ridge()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.5
# First fit only to Training Data
ridge_model_tuned = Ridge(alpha = best_alpha) # creating Tuned Ridge Regression model using optimised alpha value
ridge_model_tuned.fit(X_train, y_train) # Fitting the data into the tuned model
Ridge(alpha=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge(alpha=0.5)
# Check performance on training data
model_performance_regression(ridge_model_tuned, X_train, y_train) # performance metrics on training data
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168483 | 0.128628 | 0.927121 | 0.921187 | 6.350315 |
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_test, y_test) # performance metrics on test data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196089 | 0.137372 | 0.902162 | 0.859704 | 6.636593 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Lasso()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Lasso()
Lasso()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.001
# First fit only to Training Data
lasso_model_tuned = Lasso(alpha = best_alpha) # creating Tuned Lasso Regression model using optimized alpha value
lasso_model_tuned.fit(X_train, y_train) # Fitting the data into the tuned model
Lasso(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=0.001)
# Check performance on training data
model_performance_regression(lasso_model_tuned, X_train, y_train) #getting performance metrics on training data
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.172044 | 0.131336 | 0.924008 | 0.917821 | 6.507852 |
# Performance check on test data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_test, y_test) #getting performance metrics on test data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.200495 | 0.142307 | 0.897716 | 0.853328 | 6.820849 |
As expected, although optimizing Lasso helped its own case, it still did not perform better than the original model.
On to other methods...
This likely won't help since:
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
model_performance_regression(elasticnet_model, X_train, y_train)
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.327097 | 0.244417 | 0.72531 | 0.702945 | 12.360994 |
elasticnet_reg = model_performance_regression(elasticnet_model, X_test, y_test)
elasticnet_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.312763 | 0.240042 | 0.751096 | 0.643081 | 11.820109 |
Now try optimizing Elastic
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]}
model = ElasticNet()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')ElasticNet()
ElasticNet()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimized parameters for alpha
best_params
{'alpha': 0.001, 'l1_ratio': 0.05}
best_alpha = best_params['alpha']
best_ratio = best_params['l1_ratio']
elasticnet_model_tuned = ElasticNet(alpha = best_alpha, l1_ratio = best_ratio)
elasticnet_model_tuned.fit(X_train, y_train)
ElasticNet(alpha=0.001, l1_ratio=0.05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet(alpha=0.001, l1_ratio=0.05)
model_performance_regression(elasticnet_model_tuned, X_train, y_train)
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168718 | 0.128935 | 0.926917 | 0.920967 | 6.366589 |
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_test, y_test)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196586 | 0.137997 | 0.901666 | 0.858993 | 6.658785 |
models = pd.concat([lin_reg_test,
ridge_reg, ridge_reg_tuned,
lasso_reg, lasso_reg_tuned,
elasticnet_reg, elasticnet_reg_tuned],
axis=0
) #combining all models into a single dataframe
models['Models'] = ['Least Squares', 'Ridge Regression', 'Ridge Regression Tuned', 'Lasso Regression',
'Lasso Regression Tuned', 'Elastic Net Regression',
'Elastic Net Regression Tuned'] #adding names of the models as a column to the dataframe
models = models.iloc[:,[5, 0,1,2,3,4]] #ordering names of the models (column 5) as the first column
models
| Models | RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|---|
| 0 | Least Squares | 0.195618 | 0.137312 | 0.902631 | 0.860377 | 6.648424 |
| 0 | Ridge Regression | 0.196574 | 0.137780 | 0.901678 | 0.859010 | 6.655618 |
| 0 | Ridge Regression Tuned | 0.196089 | 0.137372 | 0.902162 | 0.859704 | 6.636593 |
| 0 | Lasso Regression | 0.312122 | 0.241016 | 0.752115 | 0.644543 | 11.885197 |
| 0 | Lasso Regression Tuned | 0.200495 | 0.142307 | 0.897716 | 0.853328 | 6.820849 |
| 0 | Elastic Net Regression | 0.312763 | 0.240042 | 0.751096 | 0.643081 | 11.820109 |
| 0 | Elastic Net Regression Tuned | 0.196586 | 0.137997 | 0.901666 | 0.858993 | 6.658785 |
# Collect RMSE & MAE values for all six models on one line graph
plt.plot(models['Models'], models['RMSE'], 'o-', color='lightblue')
plt.plot(models['Models'], models['MAE'], 'o-', color='blue')
plt.title('RMSE and MAE for Linear Regression Models')
plt.xlabel('Models')
plt.xticks(rotation = 60)
plt.ylabel('Error')
plt.legend(['RMSE', 'MAE'])
plt.show()
# Collect RMSE & MAE values for all six models on one line graph
plt.figure(figsize=(8, 3))
plt.plot(models['Models'], models['R-squared'], 'o-', color='green')
plt.plot(models['Models'], models['Adj. R-squared'], 'o-', color='lightgreen')
plt.title('R-squared and Adj. R-squared for Linear Regression Models')
plt.xlabel('Models')
plt.xticks(rotation = 60)
plt.ylabel('Fit')
plt.legend(['R-squared', 'Adj. R-squared'])
plt.show()
There are two main points to keep in mind when interpreting what these results mean:
The target price had a log transformation, so that needs to be reversed in order to accurately interpret the RMSE, MAE & R-sq... but NOT the Adj. R-sq nor the MAPE
In particular, the MAPE is a percentage, so we'll clarify what that means regarding the actual target prices. Here, we see that the best model (OLS) has about a 6.65% error from predicted to actual price results & the worst (unoptimized Lasso) is off about 11.89%.
# Apply the exponential function to errors individually (for clarity in comparison)
# Training Data
rmse1 = math.exp(lin_reg_train['RMSE'])
mae1 = math.exp(lin_reg_train['MAE'])
# Test Data
rmse2 = math.exp(lin_reg_test['RMSE'])
mae2 = math.exp(lin_reg_test['MAE'])
# Print the results
print(bold_text("TRAIN:"))
print("RMSE:", round(rmse1, 3))
print("MAE:", round(mae1, 3))
print('-' * 12) # print separator line
print(bold_text("TEST:"))
print("RMSE:", round(rmse2, 3))
print("MAE:", round(mae2, 3))
TRAIN: RMSE: 1.183 MAE: 1.137 ------------ TEST: RMSE: 1.216 MAE: 1.147
# Convert the above to INR & USD
# Training INR & USD
inr_error1 = rmse1 * 100000 # converting based on the price variable definition
inr_error2 = mae1 * 100000
usd_error1 = inr_error1 * 0.012 # use currency conversion of 1 INR = 0.012 USD
usd_error2 = inr_error2 * 0.012
# Test INR & USD
inr_error3 = rmse2 * 100000
inr_error4 = mae2 * 100000
usd_error3 = inr_error3 * 0.012
usd_error4 = inr_error4 * 0.012
# Format the amounts for money
#Training
inr_error1_formatted = '₹{:,.2f}'.format(inr_error1)
inr_error2_formatted = '₹{:,.2f}'.format(inr_error2)
usd_error1_formatted = '${:,.2f}'.format(usd_error1)
usd_error2_formatted = '${:,.2f}'.format(usd_error2)
#Test
inr_error3_formatted = '₹{:,.2f}'.format(inr_error3)
inr_error4_formatted = '₹{:,.2f}'.format(inr_error4)
usd_error3_formatted = '${:,.2f}'.format(usd_error3)
usd_error4_formatted = '${:,.2f}'.format(usd_error4)
##### PRINT the converted results
print(bold_text("TRAIN:"))
print("RMSE (INR): ", inr_error1_formatted)
print("MAE (INR): ", inr_error2_formatted)
print('-' * 10)
print("RMSE (USD): ", usd_error1_formatted)
print("MAE (USD): ", usd_error2_formatted)
print('\n', '=' * 30, '\n') # print separator line
print(bold_text("TEST:"))
print("RMSE (INR): ", inr_error3_formatted)
print("MAE (INR): ", inr_error4_formatted)
print('-' * 10)
print("RMSE (USD): ", usd_error3_formatted)
print("MAE (USD): ", usd_error4_formatted)
TRAIN: RMSE (INR): ₹118,311.79 MAE (INR): ₹113,670.88 ---------- RMSE (USD): $1,419.74 MAE (USD): $1,364.05 ============================== TEST: RMSE (INR): ₹121,606.25 MAE (INR): ₹114,718.61 ---------- RMSE (USD): $1,459.27 MAE (USD): $1,376.62
Error Conversion Observation (pt. 1 - raw money)
# INR mean price
inr_mean_price = data['Price'].mean() * 100000
inr_mean_price_formatted = '₹{:,.2f}'.format(inr_mean_price)
print('Average ', bold_text('INR Price: '), inr_mean_price_formatted)
# USD mean price
usd_mean_price = inr_mean_price * 0.012
usd_mean_price_formatted = '${:,.2f}'.format(usd_mean_price)
print('Average ', bold_text('USD Price: '), usd_mean_price_formatted)
Average INR Price: ₹1,058,015.69 Average USD Price: $12,696.19
# Percentage of error off mean
inr_mean_error_percent = 120000 / inr_mean_price * 100 # Just using 120,000 as a ballpark based on the above results
inr_mean_error_percent_rounded = round(inr_mean_error_percent, 2)
usd_mean_error_percent = 1400 / usd_mean_price * 100 # Similarly with 1,400
usd_mean_error_percent_rounded = round(usd_mean_error_percent, 2)
print('INR Error is about', bold_text(str(inr_mean_error_percent_rounded)+'%'), 'off the mean')
print('USD Error is about', bold_text(str(usd_mean_error_percent_rounded)+'%'), 'off the mean')
INR Error is about 11.34% off the mean USD Error is about 11.03% off the mean
Error Conversion Observation (pt. 2 - percentage of the mean)
This approximates to only about 11% off the mean price in both currencies (the difference is due to some manual rounding).
It might seem unnecessary to calculate the error percentage off the mean since we already had the MAPE, but that was a more specific percentage off predicted/actual values... I was just curious to see how that error compared to the actual mean.
Models Comparisons:
As we can see from the comparison summary, Least Squares performed the best (lowest errors & best fits).
In order to interpret the results properly, we'll perform a "math.exp" on the first three metrics to reverse the log transformation.
data_log = data.copy() # making a copy so I can return to "data" if this log transformation doesn't help
data_log['km_log'] = np.log(data_log['Kilometers_Driven'] + 1)
# Split the data BUT with an extra "_log" to differentiate between our previous model
X_log = data_log.drop(['Price', 'Engine', 'Kilometers_Driven', 'price_log'], axis=1) # 'Kilometers_Driven' => 'km_log'
y = data_log['price_log'] # y doesn't need to change since it's still the same dependent variable
# Encode categorical variables
X_log = pd.get_dummies(X_log, columns=['Location','Fuel_Type','Transmission','Owner_Type', 'Manufacturer'])
# Split the data into train and test
X_log_train, X_log_test, y_train, y_test = train_test_split(
X_log, y,
test_size = 0.20, # sets 20% for test data & leaving 80% for training data
shuffle = True, # reorders the list when taking random samples
random_state = 1 # sets a seed to reproduce results while comparing different model runs
)
import statsmodels.api as sm
# Just reusing "X_train2" & "X_test2" as copies of the "X_log_" versions since that's how..
# -> I had written them before copying over a whole new linear model group for without constant
X_train2 = X_log_train
# Cast the boolean array to np.float64
X_train2 = X_train2.astype(np.float64)
X_test2 = X_log_test
# Train the model
olsmodel2 = sm.OLS(y_train, X_train2).fit()
# Get the model summary
olsmodel2.summary()
print(olsmodel2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.925
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 172.4
Date: Fri, 12 Apr 2024 Prob (F-statistic): 3.71e-292
Time: 04:13:40 Log-Likelihood: 214.24
No. Observations: 612 AIC: -344.5
Df Residuals: 570 BIC: -159.0
Df Model: 41
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Year 0.0979 0.004 23.620 0.000 0.090 0.106
Mileage -0.0170 0.004 -4.013 0.000 -0.025 -0.009
Power 0.0087 0.000 18.687 0.000 0.008 0.010
Seats -0.0505 0.017 -3.052 0.002 -0.083 -0.018
km_log -0.0412 0.011 -3.819 0.000 -0.062 -0.020
Location_Ahmedabad -13.5467 0.585 -23.154 0.000 -14.696 -12.398
Location_Bangalore -13.5134 0.587 -23.023 0.000 -14.666 -12.361
Location_Chennai -13.6001 0.584 -23.283 0.000 -14.747 -12.453
Location_Coimbatore -13.5404 0.591 -22.895 0.000 -14.702 -12.379
Location_Delhi -13.6604 0.588 -23.234 0.000 -14.815 -12.506
Location_Hyderabad -13.5546 0.586 -23.149 0.000 -14.705 -12.405
Location_Jaipur -13.6074 0.588 -23.135 0.000 -14.763 -12.452
Location_Kochi -13.6312 0.592 -23.014 0.000 -14.795 -12.468
Location_Kolkata -13.7847 0.585 -23.565 0.000 -14.934 -12.636
Location_Mumbai -13.6246 0.584 -23.312 0.000 -14.773 -12.477
Location_Pune -13.6468 0.588 -23.192 0.000 -14.803 -12.491
Fuel_Type_CNG -49.8723 2.152 -23.180 0.000 -54.098 -45.646
Fuel_Type_Diesel -49.7703 2.154 -23.109 0.000 -54.000 -45.540
Fuel_Type_Petrol -50.0677 2.148 -23.305 0.000 -54.287 -45.848
Transmission_Automatic -74.8255 3.228 -23.181 0.000 -81.165 -68.486
Transmission_Manual -74.8848 3.225 -23.221 0.000 -81.219 -68.551
Owner_Type_First -49.9117 2.163 -23.076 0.000 -54.160 -45.663
Owner_Type_Second -49.9223 2.159 -23.126 0.000 -54.162 -45.682
Owner_Type_Third -49.8763 2.133 -23.383 0.000 -54.066 -45.687
Manufacturer_Audi -6.5544 0.296 -22.156 0.000 -7.135 -5.973
Manufacturer_BMW -6.6110 0.292 -22.607 0.000 -7.185 -6.037
Manufacturer_Datsun -7.0583 0.307 -23.000 0.000 -7.661 -6.456
Manufacturer_Fiat -7.3114 0.338 -21.622 0.000 -7.976 -6.647
Manufacturer_Ford -6.8682 0.299 -23.004 0.000 -7.455 -6.282
Manufacturer_Honda -6.7947 0.296 -22.993 0.000 -7.375 -6.214
Manufacturer_Hyundai -6.8863 0.296 -23.263 0.000 -7.468 -6.305
Manufacturer_Isuzu -6.8835 0.341 -20.185 0.000 -7.553 -6.214
Manufacturer_Jaguar -6.6830 0.320 -20.902 0.000 -7.311 -6.055
Manufacturer_Jeep -6.9335 0.304 -22.786 0.000 -7.531 -6.336
Manufacturer_Mahindra -6.9214 0.296 -23.406 0.000 -7.502 -6.341
Manufacturer_Maruti -6.7986 0.296 -22.961 0.000 -7.380 -6.217
Manufacturer_Mercedes-Benz -6.4203 0.299 -21.449 0.000 -7.008 -5.832
Manufacturer_Mini -6.1088 0.296 -20.617 0.000 -6.691 -5.527
Manufacturer_Mitsubishi -6.8509 0.303 -22.645 0.000 -7.445 -6.257
Manufacturer_Nissan -6.8805 0.296 -23.256 0.000 -7.462 -6.299
Manufacturer_Renault -6.9559 0.298 -23.334 0.000 -7.541 -6.370
Manufacturer_Skoda -6.9285 0.298 -23.258 0.000 -7.514 -6.343
Manufacturer_Tata -7.0505 0.300 -23.510 0.000 -7.640 -6.461
Manufacturer_Toyota -6.7406 0.294 -22.892 0.000 -7.319 -6.162
Manufacturer_Volkswagen -6.9490 0.294 -23.601 0.000 -7.527 -6.371
Manufacturer_Volvo -6.5210 0.347 -18.802 0.000 -7.202 -5.840
==============================================================================
Omnibus: 37.927 Durbin-Watson: 1.887
Prob(Omnibus): 0.000 Jarque-Bera (JB): 100.909
Skew: -0.276 Prob(JB): 1.22e-22
Kurtosis: 4.911 Cond. No. 1.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.48e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print('Performance on ' + bold_text('TRAIN') + ' data: ')
lin_reg_train = model_performance_regression(olsmodel2, X_train2, y_train)
lin_reg_train
Performance on TRAIN data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.170502 | 0.130416 | 0.925364 | 0.919288 | 6.471896 |
print('Performance on ' + bold_text('TEST') + ' data: ')
lin_reg_test = model_performance_regression(olsmodel2, X_test2, y_test)
lin_reg_test
Performance on TEST data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.198575 | 0.136468 | 0.899666 | 0.856124 | 6.635891 |
from statsmodels.stats.outliers_influence import variance_inflation_factor
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# Calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
print(checking_vif(X_train2))
feature VIF 0 Year 1.961651 1 Mileage 6.430930 2 Power 7.203300 3 Seats 3.236457 4 km_log 1.779721 5 Location_Ahmedabad inf 6 Location_Bangalore inf 7 Location_Chennai inf 8 Location_Coimbatore inf 9 Location_Delhi inf 10 Location_Hyderabad inf 11 Location_Jaipur inf 12 Location_Kochi inf 13 Location_Kolkata inf 14 Location_Mumbai inf 15 Location_Pune inf 16 Fuel_Type_CNG inf 17 Fuel_Type_Diesel inf 18 Fuel_Type_Petrol inf 19 Transmission_Automatic inf 20 Transmission_Manual inf 21 Owner_Type_First inf 22 Owner_Type_Second inf 23 Owner_Type_Third inf 24 Manufacturer_Audi inf 25 Manufacturer_BMW inf 26 Manufacturer_Datsun inf 27 Manufacturer_Fiat inf 28 Manufacturer_Ford inf 29 Manufacturer_Honda inf 30 Manufacturer_Hyundai inf 31 Manufacturer_Isuzu inf 32 Manufacturer_Jaguar inf 33 Manufacturer_Jeep inf 34 Manufacturer_Mahindra inf 35 Manufacturer_Maruti inf 36 Manufacturer_Mercedes-Benz inf 37 Manufacturer_Mini inf 38 Manufacturer_Mitsubishi inf 39 Manufacturer_Nissan inf 40 Manufacturer_Renault inf 41 Manufacturer_Skoda inf 42 Manufacturer_Tata inf 43 Manufacturer_Toyota inf 44 Manufacturer_Volkswagen inf 45 Manufacturer_Volvo inf
# Run Ridge on most recent data (3rd round)
ridge_model = Ridge()
ridge_model.fit(X_train2, y_train)
Ridge()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge()
# Running it on training data w/ default alpha=1
ridge_reg = model_performance_regression(ridge_model, X_train2, y_train)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.171429 | 0.131601 | 0.92455 | 0.918407 | 6.533418 |
# On test data
ridge_reg = model_performance_regression(ridge_model, X_test2, y_test)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199574 | 0.1369 | 0.898654 | 0.854674 | 6.633909 |
# Run Lasso on most recent data (3rd round)
lasso_model = Lasso()
lasso_model.fit(X_train2, y_train)
Lasso()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso()
# Running it on training data w/ default alpha=1
lasso_reg = model_performance_regression(lasso_model, X_train2, y_train)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.358635 | 0.260463 | 0.669786 | 0.642901 | 13.343935 |
# On test data
lasso_reg = model_performance_regression(lasso_model, X_test2, y_test)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.350243 | 0.269687 | 0.687868 | 0.552414 | 13.09274 |
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train2, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
elasticnet_reg = model_performance_regression(elasticnet_model, X_test2, y_test)
elasticnet_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.329328 | 0.249903 | 0.724032 | 0.604273 | 12.048862 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Ridge()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train2, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Ridge()
Ridge()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.5
# First fit only to Training Data
ridge_model_tuned = Ridge(alpha = best_alpha) # creating Tuned Ridge Regression model using optimised alpha value
ridge_model_tuned.fit(X_train2, y_train) # Fitting the data into the tuned model
Ridge(alpha=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge(alpha=0.5)
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_train2, y_train) # performance metrics on training data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.170853 | 0.131028 | 0.925056 | 0.918955 | 6.505385 |
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_test2, y_test) # performance metrics on test data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199065 | 0.136567 | 0.89917 | 0.855414 | 6.621545 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Lasso()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train2, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Lasso()
Lasso()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.001
# First fit only to Training Data
lasso_model_tuned = Lasso(alpha = best_alpha) # creating Tuned Lasso Regression model using optimized alpha value
lasso_model_tuned.fit(X_train2, y_train) # Fitting the data into the tuned model
Lasso(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=0.001)
# Check performance on training data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_train2, y_train) #getting performance metrics on training data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.174365 | 0.133523 | 0.921943 | 0.915588 | 6.639703 |
# Performance check on test data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_test2, y_test) #getting performance metrics on test data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.204342 | 0.141808 | 0.893753 | 0.847646 | 6.812168 |
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train2, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]}
model = ElasticNet()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train2, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')ElasticNet()
ElasticNet()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimized parameters for alpha
best_params
{'alpha': 0.001, 'l1_ratio': 0.05}
best_alpha = best_params['alpha']
best_ratio = best_params['l1_ratio']
elasticnet_model_tuned = ElasticNet(alpha = best_alpha, l1_ratio = best_ratio)
elasticnet_model_tuned.fit(X_train2, y_train)
ElasticNet(alpha=0.001, l1_ratio=0.05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet(alpha=0.001, l1_ratio=0.05)
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_train2, y_train)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.171089 | 0.131275 | 0.924849 | 0.91873 | 6.518294 |
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_test2, y_test)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199576 | 0.137137 | 0.898652 | 0.85467 | 6.640268 |
models= pd.concat([lin_reg_test,
ridge_reg, ridge_reg_tuned,
lasso_reg, lasso_reg_tuned,
elasticnet_reg, elasticnet_reg_tuned],
axis=0
) #combining all models into a single dataframe
models['Models'] = ['Least Squares', 'Ridge Regression', 'Ridge Regression Tuned', 'Lasso Regression',
'Lasso Regression Tuned', 'Elastic Net Regression',
'Elastic Net Regression Tuned'] #adding names of the models as a column to the dataframe
models = models.iloc[:,[5, 0,1,2,3,4]] #ordering names of the models as the first column
models
| Models | RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|---|
| 0 | Least Squares | 0.198575 | 0.136468 | 0.899666 | 0.856124 | 6.635891 |
| 0 | Ridge Regression | 0.199574 | 0.136900 | 0.898654 | 0.854674 | 6.633909 |
| 0 | Ridge Regression Tuned | 0.199065 | 0.136567 | 0.899170 | 0.855414 | 6.621545 |
| 0 | Lasso Regression | 0.350243 | 0.269687 | 0.687868 | 0.552414 | 13.092740 |
| 0 | Lasso Regression Tuned | 0.204342 | 0.141808 | 0.893753 | 0.847646 | 6.812168 |
| 0 | Elastic Net Regression | 0.329328 | 0.249903 | 0.724032 | 0.604273 | 12.048862 |
| 0 | Elastic Net Regression Tuned | 0.199576 | 0.137137 | 0.898652 | 0.854670 | 6.640268 |
# Using original OLS model w/ "X_train1" since that yielded best performance metrics
y_pred_log = olsmodel.predict(X_train)
y_train_predicted = olsmodel.predict(X_train)
y_test_predicted = olsmodel.predict(X_test)
y_pred = np.exp(y_pred_log)
y_actual = data['Price']
# Residuals
residuals_train = y_train - y_train_predicted
residuals_test = y_test - y_test_predicted
residuals_train.mean()
1.3311628487952912e-13
residuals_test.mean()
-0.01936793395842632
Both residuals are quite low, so very close to 0 (especially the training)
Residuals = 0 Check!
# Plot histogram of residuals
sns.histplot(residuals_train, kde=True);
# Plot histogram of residuals
sns.histplot(residuals_test, kde=True);
Plots look quite normally distributed (except a slight blip with the test data off to the side).
Error Normality = Check!
# Predicted values
fitted = olsmodel.predict(X_train)
# Plotting Residuals VS Fitted Values
sns.residplot(x = fitted, y = residuals_train, color="lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
# Predicted values
fitted = olsmodel.predict(X_test)
# Plotting Residuals VS Fitted Values
sns.residplot(x = fitted, y = residuals_test, color="lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals_train, X_train)
lzip(name, test)
[('F statistic', 1.3210813330703655), ('p-value', 0.01174257770805736)]
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals_test, X_test)
lzip(name, test)
[('F statistic', 0.5453798908479777), ('p-value', 0.9709654282129128)]
import statsmodels.api as sm
# Statsmodel API does not add a constant by default. We need to add it explicitly.
# The constant avoids the model from forcing an intercept @ the origin
X_train1 = sm.add_constant(X_train)
# Add constant to test data
X_test1 = sm.add_constant(X_test)
# Train the model
olsmodel1 = sm.OLS(y_train, X_train1).fit()
# Get the model summary
olsmodel1.summary()
print(olsmodel1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.927
Model: OLS Adj. R-squared: 0.922
Method: Least Squares F-statistic: 177.6
Date: Fri, 12 Apr 2024 Prob (F-statistic): 1.43e-295
Time: 04:13:57 Log-Likelihood: 222.73
No. Observations: 612 AIC: -361.5
Df Residuals: 570 BIC: -176.0
Df Model: 41
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -80.8681 3.675 -22.003 0.000 -88.087 -73.649
Year 0.0934 0.004 22.246 0.000 0.085 0.102
Kilometers_Driven -1.92e-06 3.45e-07 -5.570 0.000 -2.6e-06 -1.24e-06
Mileage -0.0165 0.004 -3.968 0.000 -0.025 -0.008
Power 0.0088 0.000 19.015 0.000 0.008 0.010
Seats -0.0485 0.016 -2.970 0.003 -0.081 -0.016
Location_Ahmedabad -7.2927 0.333 -21.909 0.000 -7.947 -6.639
Location_Bangalore -7.2472 0.336 -21.586 0.000 -7.907 -6.588
Location_Chennai -7.3427 0.332 -22.094 0.000 -7.995 -6.690
Location_Coimbatore -7.2851 0.338 -21.532 0.000 -7.950 -6.621
Location_Delhi -7.4077 0.335 -22.093 0.000 -8.066 -6.749
Location_Hyderabad -7.2844 0.334 -21.779 0.000 -7.941 -6.627
Location_Jaipur -7.3526 0.336 -21.912 0.000 -8.012 -6.694
Location_Kochi -7.3747 0.339 -21.732 0.000 -8.041 -6.708
Location_Kolkata -7.5321 0.333 -22.613 0.000 -8.186 -6.878
Location_Mumbai -7.3704 0.332 -22.178 0.000 -8.023 -6.718
Location_Pune -7.3785 0.337 -21.869 0.000 -8.041 -6.716
Fuel_Type_CNG -26.9264 1.225 -21.973 0.000 -29.333 -24.519
Fuel_Type_Diesel -26.8192 1.228 -21.837 0.000 -29.231 -24.407
Fuel_Type_Petrol -27.1226 1.223 -22.172 0.000 -29.525 -24.720
Transmission_Automatic -40.4053 1.839 -21.970 0.000 -44.018 -36.793
Transmission_Manual -40.4628 1.836 -22.036 0.000 -44.069 -36.856
Owner_Type_First -26.9932 1.234 -21.870 0.000 -29.417 -24.569
Owner_Type_Second -27.0004 1.230 -21.944 0.000 -29.417 -24.584
Owner_Type_Third -26.8745 1.214 -22.136 0.000 -29.259 -24.490
Manufacturer_Audi -3.4311 0.172 -19.917 0.000 -3.769 -3.093
Manufacturer_BMW -3.4947 0.168 -20.817 0.000 -3.824 -3.165
Manufacturer_Datsun -3.9386 0.186 -21.190 0.000 -4.304 -3.574
Manufacturer_Fiat -4.1512 0.240 -17.276 0.000 -4.623 -3.679
Manufacturer_Ford -3.7415 0.174 -21.478 0.000 -4.084 -3.399
Manufacturer_Honda -3.6699 0.170 -21.599 0.000 -4.004 -3.336
Manufacturer_Hyundai -3.7483 0.171 -21.888 0.000 -4.085 -3.412
Manufacturer_Isuzu -3.7655 0.239 -15.762 0.000 -4.235 -3.296
Manufacturer_Jaguar -3.5482 0.210 -16.877 0.000 -3.961 -3.135
Manufacturer_Jeep -3.8080 0.181 -21.006 0.000 -4.164 -3.452
Manufacturer_Mahindra -3.7733 0.175 -21.621 0.000 -4.116 -3.431
Manufacturer_Maruti -3.6708 0.171 -21.527 0.000 -4.006 -3.336
Manufacturer_Mercedes-Benz -3.2959 0.177 -18.659 0.000 -3.643 -2.949
Manufacturer_Mini -2.9949 0.173 -17.291 0.000 -3.335 -2.655
Manufacturer_Mitsubishi -3.7073 0.187 -19.831 0.000 -4.074 -3.340
Manufacturer_Nissan -3.7558 0.173 -21.719 0.000 -4.095 -3.416
Manufacturer_Renault -3.8190 0.175 -21.859 0.000 -4.162 -3.476
Manufacturer_Skoda -3.7974 0.174 -21.822 0.000 -4.139 -3.456
Manufacturer_Tata -3.9136 0.175 -22.337 0.000 -4.258 -3.570
Manufacturer_Toyota -3.6030 0.170 -21.148 0.000 -3.938 -3.268
Manufacturer_Volkswagen -3.8253 0.169 -22.624 0.000 -4.157 -3.493
Manufacturer_Volvo -3.4147 0.243 -14.053 0.000 -3.892 -2.937
==============================================================================
Omnibus: 35.294 Durbin-Watson: 1.871
Prob(Omnibus): 0.000 Jarque-Bera (JB): 105.621
Skew: -0.184 Prob(JB): 1.16e-23
Kurtosis: 5.002 Cond. No. 1.22e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.02e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print('Performance on ' + bold_text('TRAIN') + ' data: ')
lin_reg_train = model_performance_regression(olsmodel1, X_train1, y_train)
lin_reg_train
Performance on TRAIN data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168153 | 0.128137 | 0.927406 | 0.921356 | 6.322593 |
print('Performance on ' + bold_text('TEST') + ' data: ')
lin_reg_test = model_performance_regression(olsmodel1, X_test1, y_test)
lin_reg_test
Performance on TEST data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.195618 | 0.137312 | 0.902631 | 0.859047 | 6.648424 |
TRAIN vs. TEST Observations:
The errors (RMSE, MAE & MAPE) all slightly increased
While the fit (R-sq & Adj. R-sq) slightly decreased (which makes sense if the errors increased since they're generally inversely related).
Although it's not a drastic difference, it could suggest slight overfitting. So let's consider the OLS Summary results.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# Calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
print(checking_vif(X_train1))
feature VIF 0 const 0.000000 1 Year 2.067078 2 Kilometers_Driven 1.963702 3 Mileage 6.390740 4 Power 7.204895 5 Seats 3.239493 6 Location_Ahmedabad inf 7 Location_Bangalore inf 8 Location_Chennai inf 9 Location_Coimbatore inf 10 Location_Delhi inf 11 Location_Hyderabad inf 12 Location_Jaipur inf 13 Location_Kochi inf 14 Location_Kolkata inf 15 Location_Mumbai inf 16 Location_Pune inf 17 Fuel_Type_CNG inf 18 Fuel_Type_Diesel inf 19 Fuel_Type_Petrol inf 20 Transmission_Automatic inf 21 Transmission_Manual inf 22 Owner_Type_First inf 23 Owner_Type_Second inf 24 Owner_Type_Third inf 25 Manufacturer_Audi inf 26 Manufacturer_BMW inf 27 Manufacturer_Datsun inf 28 Manufacturer_Fiat inf 29 Manufacturer_Ford inf 30 Manufacturer_Honda inf 31 Manufacturer_Hyundai inf 32 Manufacturer_Isuzu inf 33 Manufacturer_Jaguar inf 34 Manufacturer_Jeep inf 35 Manufacturer_Mahindra inf 36 Manufacturer_Maruti inf 37 Manufacturer_Mercedes-Benz inf 38 Manufacturer_Mini inf 39 Manufacturer_Mitsubishi inf 40 Manufacturer_Nissan inf 41 Manufacturer_Renault inf 42 Manufacturer_Skoda inf 43 Manufacturer_Tata inf 44 Manufacturer_Toyota inf 45 Manufacturer_Volkswagen inf 46 Manufacturer_Volvo inf
VIF Observations
We have two variables that have scores > 5. In descending order:
But both are valuable factors to consider when pricing & buying a used car, so dropping either does not fit the context. We'll try regularization methods to address this, but it's unlikely to help much since the data lacks the traits which would benefit from these. Let's see...
# Run Ridge on most recent data (one round with constant)
ridge_model = Ridge()
ridge_model.fit(X_train1, y_train)
Ridge()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge()
# Running it on training data w/ default alpha=1
ridge_reg = model_performance_regression(ridge_model, X_train1, y_train)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.169035 | 0.129223 | 0.926643 | 0.92053 | 6.379982 |
# On test data
ridge_reg = model_performance_regression(ridge_model, X_test1, y_test)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196574 | 0.13778 | 0.901678 | 0.857667 | 6.655618 |
L2 / Ridge Observations:
The errors (RMSE, MAE & MAPE) all had slight increases in both the train & test data.
Unsurprisingly, the fit for both sets (R-sq & Adj R-sq) had minor dips.
We'll see if L1 / Lasso has different results.
# Run Lasso on most recent data (one round with constant)
lasso_model = Lasso()
lasso_model.fit(X_train1, y_train)
Lasso()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso()
# Running it on training data w/ default alpha=1
lasso_reg = model_performance_regression(lasso_model, X_train1, y_train)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.327777 | 0.245776 | 0.724166 | 0.70118 | 12.445443 |
# On test data
lasso_reg = model_performance_regression(lasso_model, X_test1, y_test)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.312122 | 0.241016 | 0.752115 | 0.641157 | 11.885197 |
L1 / Lasso Observations:
The TRAINING errors nearly doubled, and the R-sq fits lost over 20% from the Ridge Regression.
And the significantly lower fits suggest that Lasso is not a good model (at least with the default alpha) for this data.
Possible reasons for this could be:
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Ridge()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train1, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Ridge()
Ridge()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.5
# First fit only to Training Data
ridge_model_tuned = Ridge(alpha = best_alpha) # creating Tuned Ridge Regression model using optimised alpha value
ridge_model_tuned.fit(X_train1, y_train) # Fitting the data into the tuned model
Ridge(alpha=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge(alpha=0.5)
# Check performance on training data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_train1, y_train) # performance metrics on training data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168483 | 0.128628 | 0.927121 | 0.921047 | 6.350315 |
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_test1, y_test) # performance metrics on test data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196089 | 0.137372 | 0.902162 | 0.858368 | 6.636593 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Lasso()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train1, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Lasso()
Lasso()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.001
# First fit only to Training Data
lasso_model_tuned = Lasso(alpha = best_alpha) # creating Tuned Lasso Regression model using optimized alpha value
lasso_model_tuned.fit(X_train1, y_train) # Fitting the data into the tuned model
Lasso(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=0.001)
# Check performance on training data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_train1, y_train) #getting performance metrics on training data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.172044 | 0.131336 | 0.924008 | 0.917675 | 6.507852 |
# Performance check on test data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_test1, y_test) #getting performance metrics on test data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.200495 | 0.142307 | 0.897716 | 0.851932 | 6.820849 |
As expected, although optimizing Lasso helped its own case, it still did not perform better than the original model.
On to other methods...
This likely won't help since:
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train1, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
elasticnet_reg = model_performance_regression(elasticnet_model, X_train1, y_train)
elasticnet_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.327097 | 0.244417 | 0.72531 | 0.702419 | 12.360994 |
elasticnet_reg = model_performance_regression(elasticnet_model, X_test1, y_test)
elasticnet_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.312763 | 0.240042 | 0.751096 | 0.639682 | 11.820109 |
Now try optimizing Elastic
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]}
model = ElasticNet()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train1, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')ElasticNet()
ElasticNet()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimized parameters for alpha
best_params
{'alpha': 0.001, 'l1_ratio': 0.05}
best_alpha = best_params['alpha']
best_ratio = best_params['l1_ratio']
elasticnet_model_tuned = ElasticNet(alpha = best_alpha, l1_ratio = best_ratio)
elasticnet_model_tuned.fit(X_train1, y_train)
ElasticNet(alpha=0.001, l1_ratio=0.05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet(alpha=0.001, l1_ratio=0.05)
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_train1, y_train)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.168718 | 0.128935 | 0.926917 | 0.920827 | 6.366589 |
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_test1, y_test)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.196586 | 0.137997 | 0.901666 | 0.85765 | 6.658785 |
models= pd.concat([lin_reg_test,
ridge_reg, ridge_reg_tuned,
lasso_reg, lasso_reg_tuned,
elasticnet_reg, elasticnet_reg_tuned],
axis=0
) #combining all models into a single dataframe
models['Models'] = ['Least Squares', 'Ridge Regression', 'Ridge Regression Tuned', 'Lasso Regression',
'Lasso Regression Tuned', 'Elastic Net Regression',
'Elastic Net Regression Tuned'] #adding names of the models as a column to the dataframe
models = models.iloc[:,[5, 0,1,2,3,4]] #ordering names of the models as the first column
models
| Models | RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|---|
| 0 | Least Squares | 0.195618 | 0.137312 | 0.902631 | 0.859047 | 6.648424 |
| 0 | Ridge Regression | 0.196574 | 0.137780 | 0.901678 | 0.857667 | 6.655618 |
| 0 | Ridge Regression Tuned | 0.196089 | 0.137372 | 0.902162 | 0.858368 | 6.636593 |
| 0 | Lasso Regression | 0.312122 | 0.241016 | 0.752115 | 0.641157 | 11.885197 |
| 0 | Lasso Regression Tuned | 0.200495 | 0.142307 | 0.897716 | 0.851932 | 6.820849 |
| 0 | Elastic Net Regression | 0.312763 | 0.240042 | 0.751096 | 0.639682 | 11.820109 |
| 0 | Elastic Net Regression Tuned | 0.196586 | 0.137997 | 0.901666 | 0.857650 | 6.658785 |
There are two main points to keep in mind when interpreting what these results mean:
The target price had a log transformation, so that needs to be reversed in order to accurately interpret the RMSE, MAE & R-sq... but NOT the Adj. R-sq nor the MAPE
In particular, the MAPE is a percentage, so we'll clarify what that means regarding the actual target prices. Here, we see that the best model (OLS) has about a 6.65% error from predicted to actual price results & the worst (unoptimized Lasso) is off about 11.89%.
# Apply the exponential function to errors individually (for clarity in comparison)
# Training Data
rmse1 = math.exp(lin_reg_train['RMSE'])
mae1 = math.exp(lin_reg_train['MAE'])
# Test Data
rmse2 = math.exp(lin_reg_test['RMSE'])
mae2 = math.exp(lin_reg_test['MAE'])
# Print the results
print(bold_text("TRAIN:"))
print("RMSE:", round(rmse1, 3))
print("MAE:", round(mae1, 3))
print('-' * 12) # print separator line
print(bold_text("TEST:"))
print("RMSE:", round(rmse2, 3))
print("MAE:", round(mae2, 3))
TRAIN: RMSE: 1.183 MAE: 1.137 ------------ TEST: RMSE: 1.216 MAE: 1.147
# Convert the above to INR & USD
# Training INR & USD
inr_error1 = rmse1 * 100000 # converting based on the price variable definition
inr_error2 = mae1 * 100000
usd_error1 = inr_error1 * 0.012 # use currency conversion of 1 INR = 0.012 USD
usd_error2 = inr_error2 * 0.012
# Test INR & USD
inr_error3 = rmse2 * 100000
inr_error4 = mae2 * 100000
usd_error3 = inr_error3 * 0.012
usd_error4 = inr_error4 * 0.012
##### PRINT the converted results
print(bold_text("TRAIN:"))
print("RMSE (INR): ₹", round(inr_error1, 2)) # rounding to 2 for money
print("MAE (INR): ₹", round(inr_error2, 2))
print('-' * 10)
print("RMSE (USD): $", round(usd_error1, 2))
print("MAE (USD): $", round(usd_error2, 2))
print('\n', '=' * 30, '\n') # print separator line
print(bold_text("TEST:"))
print("RMSE (INR): ₹", round(inr_error3, 2))
print("MAE (INR): ₹", round(inr_error4, 2))
print('-' * 10)
print("RMSE (USD): $", round(usd_error3, 2))
print("MAE (USD): $", round(usd_error4, 2))
TRAIN: RMSE (INR): ₹ 118311.79 MAE (INR): ₹ 113670.88 ---------- RMSE (USD): $ 1419.74 MAE (USD): $ 1364.05 ============================== TEST: RMSE (INR): ₹ 121606.25 MAE (INR): ₹ 114718.61 ---------- RMSE (USD): $ 1459.27 MAE (USD): $ 1376.62
Error Conversion Observation (pt. 1 - raw money)
# INR mean price
inr_mean_price = data['Price'].mean() * 100000
inr_mean_price_formatted = '₹{:,.2f}'.format(inr_mean_price)
print('Average ', bold_text('INR Price: '), inr_mean_price_formatted)
# USD mean price
usd_mean_price = inr_mean_price * 0.012
usd_mean_price_formatted = '${:,.2f}'.format(usd_mean_price)
print('Average ', bold_text('USD Price: '), usd_mean_price_formatted)
Average INR Price: ₹1,058,015.69 Average USD Price: $12,696.19
# Percentage of error off mean
inr_mean_error_percent = 120000 / inr_mean_price * 100 # Just using 120,000 as a ballpark based on the above results
inr_mean_error_percent_rounded = round(inr_mean_error_percent, 2)
usd_mean_error_percent = 1400 / usd_mean_price * 100 # Similarly with 1,400
usd_mean_error_percent_rounded = round(usd_mean_error_percent, 2)
print('INR Error is about', bold_text(str(inr_mean_error_percent_rounded)+'%'), 'off the mean')
print('USD Error is about', bold_text(str(usd_mean_error_percent_rounded)+'%'), 'off the mean')
INR Error is about 11.34% off the mean USD Error is about 11.03% off the mean
Error Conversion Observation (pt. 2 - percentage of the mean)
This approximates to only about 11% off the mean price in both currencies (the difference is due to some manual rounding).
It might seem unnecessary to calculate the error percentage off the mean since we already had the MAPE, but that was a more specific percentage off predicted/actual values... I was just curious to see how that error compared to the actual mean.
Models Comparisons:
As we can see from the comparison summary, Least Squares performed the best (lowest errors & best fits).
In order to interpret the results properly, we'll perform a "math.exp" on the first three metrics to reverse the log transformation.
data_log = data.copy() # making a copy so I can return to "data" if this log transformation doesn't help
data_log['km_log'] = np.log(data_log['Kilometers_Driven'] + 1)
# Split the data BUT with an extra "_log" to differentiate between our previous model
X_log = data_log.drop(['Price', 'Engine', 'Kilometers_Driven', 'price_log'], axis=1) # 'Kilometers_Driven' => 'km_log'
y = data_log['price_log'] # y doesn't need to change since it's still the same dependent variable
# Encode categorical variables
X_log = pd.get_dummies(X_log, columns=['Location','Fuel_Type','Transmission','Owner_Type', 'Manufacturer'])
# Split the data into train and test
X_log_train, X_log_test, y_train, y_test = train_test_split(
X_log, y,
test_size = 0.20, # sets 20% for test data & leaving 80% for training data
shuffle = True, # reorders the list when taking random samples
random_state = 1 # sets a seed to reproduce results while comparing different model runs
)
import statsmodels.api as sm
# New 'X_train2' with the new 'km_log'
X_train2 = sm.add_constant(X_log_train)
# Cast the boolean array to np.float64
X_train2 = X_train2.astype(np.float64)
# Add constant to test data
X_test2 = sm.add_constant(X_log_test)
# Train the model
olsmodel2 = sm.OLS(y_train, X_train2).fit()
# Get the model summary
olsmodel2.summary()
print(olsmodel2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.925
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 172.4
Date: Fri, 12 Apr 2024 Prob (F-statistic): 3.71e-292
Time: 04:14:14 Log-Likelihood: 214.24
No. Observations: 612 AIC: -344.5
Df Residuals: 570 BIC: -159.0
Df Model: 41
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -84.7045 3.651 -23.201 0.000 -91.875 -77.534
Year 0.0979 0.004 23.620 0.000 0.090 0.106
Mileage -0.0170 0.004 -4.013 0.000 -0.025 -0.009
Power 0.0087 0.000 18.687 0.000 0.008 0.010
Seats -0.0505 0.017 -3.052 0.002 -0.083 -0.018
km_log -0.0412 0.011 -3.819 0.000 -0.062 -0.020
Location_Ahmedabad -7.6371 0.331 -23.070 0.000 -8.287 -6.987
Location_Bangalore -7.6038 0.333 -22.834 0.000 -8.258 -6.950
Location_Chennai -7.6905 0.330 -23.301 0.000 -8.339 -7.042
Location_Coimbatore -7.6308 0.337 -22.648 0.000 -8.293 -6.969
Location_Delhi -7.7508 0.334 -23.222 0.000 -8.406 -7.095
Location_Hyderabad -7.6450 0.331 -23.086 0.000 -8.295 -6.995
Location_Jaipur -7.6978 0.334 -23.042 0.000 -8.354 -7.042
Location_Kochi -7.7216 0.338 -22.858 0.000 -8.385 -7.058
Location_Kolkata -7.8751 0.331 -23.807 0.000 -8.525 -7.225
Location_Mumbai -7.7150 0.330 -23.381 0.000 -8.363 -7.067
Location_Pune -7.7372 0.334 -23.157 0.000 -8.393 -7.081
Fuel_Type_CNG -28.2037 1.218 -23.155 0.000 -30.596 -25.811
Fuel_Type_Diesel -28.1017 1.220 -23.037 0.000 -30.498 -25.706
Fuel_Type_Petrol -28.3991 1.215 -23.381 0.000 -30.785 -26.013
Transmission_Automatic -42.3226 1.827 -23.166 0.000 -45.911 -38.734
Transmission_Manual -42.3819 1.824 -23.236 0.000 -45.964 -38.799
Owner_Type_First -28.2431 1.229 -22.976 0.000 -30.657 -25.829
Owner_Type_Second -28.2537 1.225 -23.063 0.000 -30.660 -25.847
Owner_Type_Third -28.2078 1.200 -23.507 0.000 -30.565 -25.851
Manufacturer_Audi -3.5996 0.172 -20.943 0.000 -3.937 -3.262
Manufacturer_BMW -3.6562 0.168 -21.772 0.000 -3.986 -3.326
Manufacturer_Datsun -4.1035 0.187 -21.998 0.000 -4.470 -3.737
Manufacturer_Fiat -4.3566 0.239 -18.224 0.000 -4.826 -3.887
Manufacturer_Ford -3.9134 0.173 -22.559 0.000 -4.254 -3.573
Manufacturer_Honda -3.8399 0.169 -22.655 0.000 -4.173 -3.507
Manufacturer_Hyundai -3.9314 0.170 -23.165 0.000 -4.265 -3.598
Manufacturer_Isuzu -3.9287 0.240 -16.348 0.000 -4.401 -3.457
Manufacturer_Jaguar -3.7282 0.209 -17.838 0.000 -4.139 -3.318
Manufacturer_Jeep -3.9787 0.181 -22.027 0.000 -4.333 -3.624
Manufacturer_Mahindra -3.9666 0.172 -23.084 0.000 -4.304 -3.629
Manufacturer_Maruti -3.8438 0.170 -22.611 0.000 -4.178 -3.510
Manufacturer_Mercedes-Benz -3.4655 0.176 -19.665 0.000 -3.812 -3.119
Manufacturer_Mini -3.1540 0.173 -18.193 0.000 -3.494 -2.813
Manufacturer_Mitsubishi -3.8961 0.185 -21.068 0.000 -4.259 -3.533
Manufacturer_Nissan -3.9257 0.173 -22.686 0.000 -4.266 -3.586
Manufacturer_Renault -4.0011 0.173 -23.110 0.000 -4.341 -3.661
Manufacturer_Skoda -3.9737 0.173 -22.913 0.000 -4.314 -3.633
Manufacturer_Tata -4.0957 0.174 -23.524 0.000 -4.438 -3.754
Manufacturer_Toyota -3.7858 0.169 -22.465 0.000 -4.117 -3.455
Manufacturer_Volkswagen -3.9942 0.169 -23.673 0.000 -4.326 -3.663
Manufacturer_Volvo -3.5662 0.246 -14.494 0.000 -4.049 -3.083
==============================================================================
Omnibus: 37.927 Durbin-Watson: 1.887
Prob(Omnibus): 0.000 Jarque-Bera (JB): 100.909
Skew: -0.276 Prob(JB): 1.22e-22
Kurtosis: 4.911 Cond. No. 1.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.49e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print('Performance on ' + bold_text('TRAIN') + ' data: ')
lin_reg_train = model_performance_regression(olsmodel2, X_train2, y_train)
lin_reg_train
Performance on TRAIN data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.170502 | 0.130416 | 0.925364 | 0.919144 | 6.471896 |
print('Performance on ' + bold_text('TEST') + ' data: ')
lin_reg_test = model_performance_regression(olsmodel2, X_test2, y_test)
lin_reg_test
Performance on TEST data:
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.198575 | 0.136468 | 0.899666 | 0.854754 | 6.635891 |
from statsmodels.stats.outliers_influence import variance_inflation_factor
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# Calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
print(checking_vif(X_train2))
feature VIF 0 const 0.000000 1 Year 1.961651 2 Mileage 6.430930 3 Power 7.203300 4 Seats 3.236457 5 km_log 1.779721 6 Location_Ahmedabad inf 7 Location_Bangalore inf 8 Location_Chennai inf 9 Location_Coimbatore inf 10 Location_Delhi inf 11 Location_Hyderabad inf 12 Location_Jaipur inf 13 Location_Kochi inf 14 Location_Kolkata inf 15 Location_Mumbai inf 16 Location_Pune inf 17 Fuel_Type_CNG inf 18 Fuel_Type_Diesel inf 19 Fuel_Type_Petrol inf 20 Transmission_Automatic inf 21 Transmission_Manual inf 22 Owner_Type_First inf 23 Owner_Type_Second inf 24 Owner_Type_Third inf 25 Manufacturer_Audi inf 26 Manufacturer_BMW inf 27 Manufacturer_Datsun inf 28 Manufacturer_Fiat inf 29 Manufacturer_Ford inf 30 Manufacturer_Honda inf 31 Manufacturer_Hyundai inf 32 Manufacturer_Isuzu inf 33 Manufacturer_Jaguar inf 34 Manufacturer_Jeep inf 35 Manufacturer_Mahindra inf 36 Manufacturer_Maruti inf 37 Manufacturer_Mercedes-Benz inf 38 Manufacturer_Mini inf 39 Manufacturer_Mitsubishi inf 40 Manufacturer_Nissan inf 41 Manufacturer_Renault inf 42 Manufacturer_Skoda inf 43 Manufacturer_Tata inf 44 Manufacturer_Toyota inf 45 Manufacturer_Volkswagen inf 46 Manufacturer_Volvo inf
# Run Ridge on most recent data (3rd round)
ridge_model = Ridge()
ridge_model.fit(X_train2, y_train)
Ridge()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge()
# Running it on training data w/ default alpha=1
ridge_reg = model_performance_regression(ridge_model, X_train2, y_train)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.171429 | 0.131601 | 0.92455 | 0.918262 | 6.533418 |
# On test data
ridge_reg = model_performance_regression(ridge_model, X_test2, y_test)
ridge_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199574 | 0.1369 | 0.898654 | 0.85329 | 6.633909 |
# Run Lasso on most recent data (3rd round)
lasso_model = Lasso()
lasso_model.fit(X_train2, y_train)
Lasso()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso()
# Running it on training data w/ default alpha=1
lasso_reg = model_performance_regression(lasso_model, X_train2, y_train)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.358635 | 0.260463 | 0.669786 | 0.642268 | 13.343935 |
# On test data
lasso_reg = model_performance_regression(lasso_model, X_test2, y_test)
lasso_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.350243 | 0.269687 | 0.687868 | 0.548152 | 13.09274 |
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train2, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
elasticnet_reg = model_performance_regression(elasticnet_model, X_test2, y_test)
elasticnet_reg
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.329328 | 0.249903 | 0.724032 | 0.600504 | 12.048862 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Ridge()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train1, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Ridge(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Ridge()
Ridge()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.5
# First fit only to Training Data
ridge_model_tuned = Ridge(alpha = best_alpha) # creating Tuned Ridge Regression model using optimised alpha value
ridge_model_tuned.fit(X_train2, y_train) # Fitting the data into the tuned model
Ridge(alpha=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge(alpha=0.5)
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_train2, y_train) # performance metrics on training data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.170853 | 0.131028 | 0.925056 | 0.918811 | 6.505385 |
# Performance check on test data
ridge_reg_tuned = model_performance_regression(ridge_model_tuned, X_test2, y_test) # performance metrics on test data
ridge_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199065 | 0.136567 | 0.89917 | 0.854037 | 6.621545 |
# Implementing KFold with 10 subsets of the most recent training data
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10, 20]} # trying different alpha values
model = Lasso()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train2, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=Lasso(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9, 1, 5, 10,
20]},
return_train_score=True, scoring='r2')Lasso()
Lasso()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimised parameters for alpha
best_alpha = best_params['alpha']
best_alpha
0.001
# First fit only to Training Data
lasso_model_tuned = Lasso(alpha = best_alpha) # creating Tuned Lasso Regression model using optimized alpha value
lasso_model_tuned.fit(X_train2, y_train) # Fitting the data into the tuned model
Lasso(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=0.001)
# Check performance on training data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_train2, y_train) #getting performance metrics on training data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.174365 | 0.133523 | 0.921943 | 0.915438 | 6.639703 |
# Performance check on test data
lasso_reg_tuned = model_performance_regression(lasso_model_tuned, X_test2, y_test) #getting performance metrics on test data
lasso_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.204342 | 0.141808 | 0.893753 | 0.846195 | 6.812168 |
elasticnet_model = ElasticNet()
elasticnet_model.fit(X_train2, y_train)
ElasticNet()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet()
folds = KFold(n_splits=10, shuffle=True, random_state=1)
params = {'alpha':[0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]}
model = ElasticNet()
model_cv = GridSearchCV(estimator=model, param_grid=params, scoring='r2', cv=folds, return_train_score=True)
model_cv.fit(X_train2, y_train)
GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=10, random_state=1, shuffle=True),
estimator=ElasticNet(),
param_grid={'alpha': [0.001, 0.01, 0.1, 0.2, 0.5, 0.9],
'l1_ratio': [0.001, 0.01, 0.02, 0.03, 0.04, 0.05]},
return_train_score=True, scoring='r2')ElasticNet()
ElasticNet()
# Working towards a variable 'best_alpha'
best_params = model_cv.best_params_ #getting optimized parameters for alpha
best_params
{'alpha': 0.001, 'l1_ratio': 0.05}
best_alpha = best_params['alpha']
best_ratio = best_params['l1_ratio']
elasticnet_model_tuned = ElasticNet(alpha = best_alpha, l1_ratio = best_ratio)
elasticnet_model_tuned.fit(X_train2, y_train)
ElasticNet(alpha=0.001, l1_ratio=0.05)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ElasticNet(alpha=0.001, l1_ratio=0.05)
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_train2, y_train)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.171089 | 0.131275 | 0.924849 | 0.918586 | 6.518294 |
elasticnet_reg_tuned = model_performance_regression(elasticnet_model_tuned, X_test2, y_test)
elasticnet_reg_tuned
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.199576 | 0.137137 | 0.898652 | 0.853286 | 6.640268 |
models= pd.concat([lin_reg_test,
ridge_reg, ridge_reg_tuned,
lasso_reg, lasso_reg_tuned,
elasticnet_reg, elasticnet_reg_tuned],
axis=0
) #combining all models into a single dataframe
models['Models'] = ['Least Squares', 'Ridge Regression', 'Ridge Regression Tuned', 'Lasso Regression',
'Lasso Regression Tuned', 'Elastic Net Regression',
'Elastic Net Regression Tuned'] #adding names of the models as a column to the dataframe
models = models.iloc[:,[5, 0,1,2,3,4]] #ordering names of the models as the first column
models
| Models | RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|---|
| 0 | Least Squares | 0.198575 | 0.136468 | 0.899666 | 0.854754 | 6.635891 |
| 0 | Ridge Regression | 0.199574 | 0.136900 | 0.898654 | 0.853290 | 6.633909 |
| 0 | Ridge Regression Tuned | 0.199065 | 0.136567 | 0.899170 | 0.854037 | 6.621545 |
| 0 | Lasso Regression | 0.350243 | 0.269687 | 0.687868 | 0.548152 | 13.092740 |
| 0 | Lasso Regression Tuned | 0.204342 | 0.141808 | 0.893753 | 0.846195 | 6.812168 |
| 0 | Elastic Net Regression | 0.329328 | 0.249903 | 0.724032 | 0.600504 | 12.048862 |
| 0 | Elastic Net Regression Tuned | 0.199576 | 0.137137 | 0.898652 | 0.853286 | 6.640268 |
There was no improvement in any model when performing log transformation with 'Kilometers_Driven' (in fact, that transformation actually performed worse), and I see no reason why we would drop it (contextually or mathematically). So I will use the original feature and not a logged version moving forward since that original OLS has the best metrics thus far.
It is interesting to note that (with the exception of default Lasso & Elastic Net), the MAPE scores slightly improved (i.e. decreased) when using 'km_log'. But the differences are about 0.01 (which means about 1/100th of a percent), and this is within the context of a logarithm, which changes the scale drastically. Thus, I still would not opt for using the 'km_log'.
# Using original OLS model w/ "X_train1" since that yielded best performance metrics
y_pred_log = olsmodel1.predict(X_train1)
y_train_predicted = olsmodel1.predict(X_train1)
y_test_predicted = olsmodel1.predict(X_test1)
y_pred = np.exp(y_pred_log)
y_actual = data['Price']
# Residuals
residuals_train = y_train - y_train_predicted
residuals_test = y_test - y_test_predicted
residuals_train.mean()
1.6738533217438414e-12
residuals_test.mean()
-0.019367933955397963
Both residuals are quite low, so very close to 0 (especially the training)
Residuals = 0 Check!
# Plot histogram of residuals
sns.histplot(residuals_train, kde=True);
# Plot histogram of residuals
sns.histplot(residuals_test, kde=True);
Plots look quite normally distributed (except a slight blip with the test data off to the side).
Error Normality = Check!
# Predicted values
fitted = olsmodel1.predict(X_train1)
# Plotting Residuals VS Fitted Values
sns.residplot(x = fitted, y = residuals_train, color="lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
# Predicted values
fitted = olsmodel1.predict(X_test1)
# Plotting Residuals VS Fitted Values
sns.residplot(x = fitted, y = residuals_test, color="lightblue")
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals_train, X_train1)
lzip(name, test)
[('F statistic', 1.3210813330703832), ('p-value', 0.011742577708054472)]
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals_test, X_test1)
lzip(name, test)
[('F statistic', 0.5453798908480055), ('p-value', 0.9709654282129023)]
KILOMETERS vs. KM_LOG
WITH vs. WITHOUT Constant
The only differences arise in the Adj. R-sq without a constant, which is probably because it penalizes models with more features.
Back to using X_train (without the constant that was added to X_train1 for the above linear regression models) since the upcoming hyperparameter tuning models are not linear.
Also using the regular 'Kiliometers_Driven' instead of 'km_log' since the former was generally better performing.
# Start fitting a Decision Tree regressor to our 'X_train' data
dt_regressor = DecisionTreeRegressor(random_state = 1) # set the random state for reproducibility
dt_regressor.fit(X_train, y_train)
DecisionTreeRegressor(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(random_state=1)
dt_regressor_perf_test = model_performance_regression(dt_regressor, X_test, y_test)
dt_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.244795 | 0.162963 | 0.847523 | 0.781353 | 8.129842 |
from sklearn import tree
features = list(X.columns)
print(len(features))
46
Trying to show 46 features in a decision tree would mean 46 layers -> not practical.
So let's train it specifically for visualization (using a maximum depth)
dt_regressor_visualize = DecisionTreeRegressor(random_state = 1, max_depth = 3) # three levels seems reasonable to visualize
dt_regressor_visualize.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_depth=3, random_state=1)
plt.figure(figsize=(20, 10))
tree.plot_tree(dt_regressor_visualize, feature_names = features,
filled = True, fontsize = 10.5, node_ids = True, class_names = True); # optional for easy-reading
*Why is 'Seats' split so large? Let's check the range to make sure it didn't get screwed up:
# Check range of seats
seats = data['Seats'].unique()
print('Range of Seats:', seats.min(), 'to', seats.max())
Range of Seats: 4.0 to 8.0
Feature Importance
Based on the levels, the model identified 'Seats' as the most informative. However, it's setting the split @ 131.1, which clearly is nonsense. I tried converting it to an int (after removing outliers/blanks) since it was originally inputted as a float (because of the missing values), but that didn't help.
The purest nodes reside on the right part of the tree:
bagging_estimator = BaggingRegressor(random_state = 1)
bagging_estimator.fit(X_train, y_train)
BaggingRegressor(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingRegressor(random_state=1)
bagging_estimator_perf_test = model_performance_regression(bagging_estimator, X_test, y_test)
bagging_estimator_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.203924 | 0.13852 | 0.894188 | 0.848269 | 6.608667 |
Bagging Observations:
As would be expected, the bagging significantly improved the result metrics (i.e. lower errors & better fits).
Also similar to the single Decision Tree, the MAPE is the only score that's actually a bit better than the original OLS (and the other metrics are not as good).
Let's see how Random Forest performs...
rf_regressor = RandomForestRegressor(random_state = 1)
rf_regressor.fit(X_train, y_train)
RandomForestRegressor(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(random_state=1)
rf_regressor_perf_test = model_performance_regression(rf_regressor, X_test, y_test)
rf_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.186186 | 0.126961 | 0.911795 | 0.873517 | 6.141731 |
Random Forest is now the best performance. It finally surpasses OLS metrics (lower errors & better fits).
from sklearn.ensemble import AdaBoostRegressor
ada_regressor = AdaBoostRegressor(random_state = 1)
ada_regressor.fit(X_train, y_train)
AdaBoostRegressor(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostRegressor(random_state=1)
ada_regressor_perf_test = model_performance_regression(ada_regressor, X_test, y_test)
ada_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.248308 | 0.1967 | 0.843114 | 0.775032 | 9.375605 |
No help with AdaBoost
from sklearn.ensemble import GradientBoostingRegressor
grad_regressor = GradientBoostingRegressor(random_state=1)
grad_regressor.fit(X_train, y_train)
GradientBoostingRegressor(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=1)
grad_regressor_perf_test = model_performance_regression(grad_regressor, X_test, y_test)
grad_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.180605 | 0.121228 | 0.917003 | 0.880986 | 5.789599 |
Even further help with Gradient Boosting. We now have even lower errors and better fits than the Random Forest. This is the BEST model so far.
Introduces L1 & L2 Regularization
Prunes trees
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.3) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.25.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.4)
from xgboost import XGBRegressor
xgb_regressor = XGBRegressor(random_state=1)
xgb_regressor.fit(X_train, y_train)
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)xgb_regressor_perf_test = model_performance_regression(xgb_regressor, X_test, y_test)
xgb_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.197076 | 0.127968 | 0.901175 | 0.858288 | 6.108154 |
XGBoost Results
Worse: RMSE (higher error), R-sq & Adju. R-sq (lower fit)
Better: MAE (lower error, but very slight) & MAPE (lower error by almost 0.1 => 0.001% error)
So not an obvious improvement over Gradient Boosting.
models_test_comp_df = pd.concat(
[
dt_regressor_perf_test.T, # Transpose so each model = column (instead of a row)
bagging_estimator_perf_test.T,
rf_regressor_perf_test.T,
ada_regressor_perf_test.T,
grad_regressor_perf_test.T,
xgb_regressor_perf_test.T
],
axis = 1 # for columns
)
models_test_comp_df.columns = [ # manually setting the column names
'Decision Tree',
'Bagging',
'Random Forest',
'AdaBoost',
'Gradient Boost',
'XG Boost'
]
models_test_comp_df
| Decision Tree | Bagging | Random Forest | AdaBoost | Gradient Boost | XG Boost | |
|---|---|---|---|---|---|---|
| RMSE | 0.244795 | 0.203924 | 0.186186 | 0.248308 | 0.180605 | 0.197076 |
| MAE | 0.162963 | 0.138520 | 0.126961 | 0.196700 | 0.121228 | 0.127968 |
| R-squared | 0.847523 | 0.894188 | 0.911795 | 0.843114 | 0.917003 | 0.901175 |
| Adj. R-squared | 0.781353 | 0.848269 | 0.873517 | 0.775032 | 0.880986 | 0.858288 |
| MAPE | 8.129842 | 6.608667 | 6.141731 | 9.375605 | 5.789599 | 6.108154 |
models_test_comp_df = pd.concat(
[
dt_regressor_perf_test,
bagging_estimator_perf_test,
rf_regressor_perf_test,
ada_regressor_perf_test,
grad_regressor_perf_test,
xgb_regressor_perf_test
]
)
models_test_comp_df.index = [ # manually setting the row names
'Decision Tree',
'Bagging',
'Random Forest',
'AdaBoost',
'Gradient Boost',
'XG Boost'
]
models_test_comp_df
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| Decision Tree | 0.244795 | 0.162963 | 0.847523 | 0.781353 | 8.129842 |
| Bagging | 0.203924 | 0.138520 | 0.894188 | 0.848269 | 6.608667 |
| Random Forest | 0.186186 | 0.126961 | 0.911795 | 0.873517 | 6.141731 |
| AdaBoost | 0.248308 | 0.196700 | 0.843114 | 0.775032 | 9.375605 |
| Gradient Boost | 0.180605 | 0.121228 | 0.917003 | 0.880986 | 5.789599 |
| XG Boost | 0.197076 | 0.127968 | 0.901175 | 0.858288 | 6.108154 |
Six Model Summary
Best Performing: Random Forest & Gradient Boost
Worst Performing: Decision Tree & AdaBoost
gb_tuned = GradientBoostingRegressor(random_state = 1)
gb_parameters = {
'n_estimators': [100, 110, 120], # include default 100 (so it's not skipped), but also search other values
'max_depth': [None, 5, 7], # "None" is default, must include
'max_features': [0.8, 1.]
}
gb_grid_obj = GridSearchCV(gb_tuned, gb_parameters, scoring = 'neg_mean_squared_error', cv = 5) # using built-in 'neg_mean_squared_error'
gb_grid_obj.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=1),
param_grid={'max_depth': [None, 5, 7], 'max_features': [0.8, 1.0],
'n_estimators': [100, 110, 120]},
scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=GradientBoostingRegressor(random_state=1),
param_grid={'max_depth': [None, 5, 7], 'max_features': [0.8, 1.0],
'n_estimators': [100, 110, 120]},
scoring='neg_mean_squared_error')GradientBoostingRegressor(random_state=1)
GradientBoostingRegressor(random_state=1)
gb_tuned_regressor = gb_grid_obj.best_estimator_
gb_tuned_regressor.fit(X_train, y_train)
GradientBoostingRegressor(max_depth=5, max_features=0.8, n_estimators=120,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingRegressor(max_depth=5, max_features=0.8, n_estimators=120,
random_state=1)gb_tuned_regressor_perf_test = model_performance_regression(gb_tuned_regressor, X_test, y_test)
gb_tuned_regressor_perf_test
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.173624 | 0.116198 | 0.923296 | 0.890009 | 5.576311 |
pd.concat(
[
models_test_comp_df,
gb_tuned_regressor_perf_test
],
)
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| Decision Tree | 0.244795 | 0.162963 | 0.847523 | 0.781353 | 8.129842 |
| Bagging | 0.203924 | 0.138520 | 0.894188 | 0.848269 | 6.608667 |
| Random Forest | 0.186186 | 0.126961 | 0.911795 | 0.873517 | 6.141731 |
| AdaBoost | 0.248308 | 0.196700 | 0.843114 | 0.775032 | 9.375605 |
| Gradient Boost | 0.180605 | 0.121228 | 0.917003 | 0.880986 | 5.789599 |
| XG Boost | 0.197076 | 0.127968 | 0.901175 | 0.858288 | 6.108154 |
| 0 | 0.173624 | 0.116198 | 0.923296 | 0.890009 | 5.576311 |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.metrics import get_scorer
# Tuned GB Learning Curve - R-squared
# use the R-squared with 10 folds (cv)
sizes, train_scores, test_scores = learning_curve(gb_tuned_regressor, X_train, y_train, cv=10, scoring='r2')
train_mean = np.mean(train_scores, axis=1) # mean of the training r-sq scores
train_std = np.std(train_scores, axis=1) # standard deviation of the training r-sq scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
### Plotting the Training & Test score graphs
# Training Data (blue)
# 'sizes': how many rows to use for the score calculation
# 'marker': shape at each x-value ('o' = dot & 's' = square)
plt.plot(sizes, train_mean, color="blue", marker='o', markersize=5, label="Training Data score")
# fill in between the mean +- one standard deviation
# 'alpha' = opacity (lower = more transparent)
plt.fill_between(sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color="blue")
# Test Data (green)
plt.plot(sizes, test_mean, color="green", linestyle='--', marker='s', markersize=5, label="Test Data score")
plt.fill_between(sizes, test_mean - test_std, test_mean + test_std, alpha=0.15, color="green")
###
plt.title("Tuned GB Learning Curve - R-squared")
plt.xlabel("Training Set Size")
plt.ylabel("Score")
plt.legend(loc="best")
plt.show()
# Check Usable Scorers (the positive versions of the error functions are not available)
from sklearn import metrics
metrics.get_scorer_names()
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted', 'matthews_corrcoef', 'max_error', 'mutual_info_score', 'neg_brier_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_gamma_deviance', 'neg_mean_poisson_deviance', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'neg_negative_likelihood_ratio', 'neg_root_mean_squared_error', 'normalized_mutual_info_score', 'positive_likelihood_ratio', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'rand_score', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'roc_auc_ovo', 'roc_auc_ovo_weighted', 'roc_auc_ovr', 'roc_auc_ovr_weighted', 'top_k_accuracy', 'v_measure_score']
# Tuned GB Learning Curve - Negative RMSE
# use the Negative RMSE with 10 folds (cv)
sizes, train_scores, test_scores = learning_curve(gb_tuned_regressor, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(sizes, train_mean, color="blue", marker='o', markersize=5, label="Training Data score")
plt.fill_between(sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color="blue")
plt.plot(sizes, test_mean, color="green", linestyle='--', marker='s', markersize=5, label="Test Data score")
plt.fill_between(sizes, test_mean - test_std, test_mean + test_std, alpha=0.15, color="green")
plt.title("Tuned GB Learning Curve - Negative RMSE")
plt.xlabel("Training Set Size")
plt.ylabel("Score")
plt.legend(loc="best")
plt.show()
# Tuned GB Learning Curve - Negative MAPE
# use the Negative MAPE with 10 folds (cv)
sizes, train_scores, test_scores = learning_curve(gb_tuned_regressor, X_train, y_train, cv=10, scoring='neg_mean_absolute_percentage_error')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(sizes, train_mean, color="blue", marker='o', markersize=5, label="Training Data score")
plt.fill_between(sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color="blue")
plt.plot(sizes, test_mean, color="green", linestyle='--', marker='s', markersize=5, label="Test Data score")
plt.fill_between(sizes, test_mean - test_std, test_mean + test_std, alpha=0.15, color="green")
plt.title("Tuned GB Learning Curve - Negative MAPE")
plt.xlabel("Training Set Size")
plt.ylabel("Score")
plt.legend(loc="best")
plt.show()
Feature Importance
# Plotting a Horizontal Bar Chart showing importances of features
importances = gb_tuned_regressor.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances with Tuned Gradient Boost')
plt.barh(range(len(features)), importances[indices])
plt.yticks(range(len(features)), [features[i] for i in indices]);
Importances Observations
Clearly, 'Power' is far and away the most important feature (probably around 6x the next feature).
'Year' has a bit importance, and the rest have hardly any impact.
Contrary to the Decision Tree, 'Seats' is down near the middle of the pack (which is basically nothing)... as opposed to the DT placing it at the top as the root node.
Also, it important to keep in mind that these importances are only a magnitude since they're always positive, meaning that we do not know the direction of each feature.
Multicollinearity: since the tuned Gradient Boost is a non-linear model, we cannot directly check VIF. But since the strongest feature ('Power') has no strong correlation with any other feature (the only strong one it had was with 'Engine', which we dropped), there should be no concern with multicollinearity after performing the Gradient Boost.
# Construct the equation
equation = ''
for feature, coef in zip(features, importances):
if coef != 0:
sign = '+' if coef > 0 else '-'
equation += f'{coef:.3f} * {feature} {sign} '
# Simplify the equation (replace and addition of negatives with just subtraction):
equation = equation.replace(' + -', ' - ')
equation = equation.replace('+-', '-')
print(f'Equation: {equation}')
Equation: 0.119 * Year + 0.032 * Kilometers_Driven + 0.038 * Mileage + 0.577 * Power + 0.001 * Seats + 0.000 * Location_Ahmedabad + 0.001 * Location_Bangalore + 0.000 * Location_Chennai + 0.004 * Location_Coimbatore + 0.000 * Location_Delhi + 0.002 * Location_Hyderabad + 0.001 * Location_Jaipur + 0.002 * Location_Kochi + 0.004 * Location_Kolkata + 0.001 * Location_Mumbai + 0.000 * Location_Pune + 0.000 * Fuel_Type_CNG + 0.025 * Fuel_Type_Diesel + 0.031 * Fuel_Type_Petrol + 0.060 * Transmission_Automatic + 0.051 * Transmission_Manual + 0.001 * Owner_Type_First + 0.000 * Owner_Type_Second + 0.000 * Owner_Type_Third + 0.000 * Manufacturer_Audi + 0.000 * Manufacturer_BMW + 0.000 * Manufacturer_Datsun + 0.000 * Manufacturer_Fiat + 0.001 * Manufacturer_Ford + 0.000 * Manufacturer_Honda + 0.005 * Manufacturer_Hyundai + 0.000 * Manufacturer_Isuzu + 0.000 * Manufacturer_Jaguar + 0.000 * Manufacturer_Jeep + 0.001 * Manufacturer_Mahindra + 0.003 * Manufacturer_Maruti + 0.000 * Manufacturer_Mercedes-Benz + 0.021 * Manufacturer_Mini + 0.000 * Manufacturer_Mitsubishi + 0.000 * Manufacturer_Nissan + 0.000 * Manufacturer_Renault + 0.002 * Manufacturer_Skoda + 0.010 * Manufacturer_Tata + 0.002 * Manufacturer_Toyota + 0.001 * Manufacturer_Volkswagen + 0.000 * Manufacturer_Volvo +
# prompt: print power coefficient
power_coef = importances[features.index('Power')]
print(f'Power Coefficient: {power_coef:.3f}')
year_coef = importances[features.index('Year')]
print(f'Year Coefficient: {year_coef:.3f}')
Power Coefficient: 0.577 Year Coefficient: 0.119
# prompt: reverse log for power and year coefficients
power_coef_exp = math.exp(power_coef) # use the exponential function on the 'power' coefficient from the feature importances (final model equation)
year_coef_exp = math.exp(year_coef)
print(f'Power Coefficient (unlogged): {power_coef_exp:.3f}') # round to three decimal places
print(f'Year Coefficient (unlogged): {year_coef_exp:.3f}')
Power Coefficient (unlogged): 1.780 Year Coefficient (unlogged): 1.126
# Converting coefficient effect back to INR
power_coef_inr = power_coef_exp * 100000
year_coef_inr = year_coef_exp * 100000
power_coef_inr_formatted = '₹{:,.2f}'.format(power_coef_inr)
year_coef_inr_formatted = '₹{:,.2f}'.format(year_coef_inr)
print('Reminder of the ', bold_text('Average INR Price: '), inr_mean_price_formatted)
print(f'Power Effect (INR): {power_coef_inr_formatted} average increase in predicted price per unit of bhp')
print(f'Year Effect (INR): {year_coef_inr_formatted} average increase in predicted price per year')
Reminder of the Average INR Price: ₹1,058,015.69
Power Effect (INR): ₹178,003.07 average increase in predicted price per unit of bhp
Year Effect (INR): ₹112,648.06 average increase in predicted price per year
# Convert INR --> USD
power_coef_usd = power_coef_inr * 0.012
year_coef_usd = year_coef_inr * 0.012
power_coef_usd_formatted = '${:,.2f}'.format(power_coef_usd)
year_coef_usd_formatted = '${:,.2f}'.format(year_coef_usd)
print('Reminder of the ', bold_text('Average USD Price: '), usd_mean_price_formatted)
print(f'Power Effect (USD): {power_coef_usd_formatted} average increase in predicted price per unit of bhp')
print(f'Year Effect (USD): {year_coef_usd_formatted} average increase in predicted price per year')
Reminder of the Average USD Price: $12,696.19
Power Effect (USD): $2,136.04 average increase in predicted price per unit of bhp
Year Effect (USD): $1,351.78 average increase in predicted price per year
# Apply the exponential function to errors individually (for clarity in comparison)
# Test Data
rmse_gb = math.exp(gb_tuned_regressor_perf_test['RMSE'])
mae_gb = math.exp(gb_tuned_regressor_perf_test['MAE'])
# Print the results
print(bold_text("Tuned Gradient Boost Error (reverse logged):"))
print("RMSE:", round(rmse_gb, 3))
print("MAE:", round(mae_gb, 3))
Tuned Gradient Boost Error (reverse logged):
RMSE: 1.19
MAE: 1.123
# Convert the above to INR & USD
inr_error_gb_rmse = rmse_gb * 100000 # converting based on the price variable definition
inr_error_gb_mae = mae_gb * 100000
usd_error_gb_rmse = inr_error_gb_rmse * 0.012 # use currency conversion of 1 INR = 0.012 USD
usd_error_gb_mae = inr_error_gb_mae * 0.012
# Format the amounts for money
inr_error_gb_rmse_formatted = '₹{:,.2f}'.format(inr_error_gb_rmse)
inr_error_gb_mae_formatted = '₹{:,.2f}'.format(inr_error_gb_mae)
usd_error_gb_rmse_formatted = '${:,.2f}'.format(usd_error_gb_rmse)
usd_error_gb_mae_formatted = '${:,.2f}'.format(usd_error_gb_mae)
##### PRINT the converted results
print(bold_text("Tuned Gradient Boost Error (in money)"))
print("RMSE (INR): ", inr_error_gb_rmse_formatted)
print("MAE (INR): ", inr_error_gb_mae_formatted)
print('-' * 10)
print("RMSE (USD): ", usd_error_gb_rmse_formatted)
print("MAE (USD): ", usd_error_gb_mae_formatted)
Tuned Gradient Boost Error (in money)
RMSE (INR): ₹118,960.78
MAE (INR): ₹112,321.80
----------
RMSE (USD): $1,427.53
MAE (USD): $1,347.86
Error Conversion Observation (pt. 1 - raw money)
# INR mean price
inr_mean_price = data['Price'].mean() * 100000
inr_mean_price_formatted = '₹{:,.2f}'.format(inr_mean_price)
print('Average ', bold_text('INR Price: '), inr_mean_price_formatted)
# USD mean price
usd_mean_price = inr_mean_price * 0.012
usd_mean_price_formatted = '${:,.2f}'.format(usd_mean_price)
print('Average ', bold_text('USD Price: '), usd_mean_price_formatted)
Average INR Price: ₹1,058,015.69 Average USD Price: $12,696.19
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design:
After all the linear and non-linear models, we can conclude that:
The best performing linear model was Least Squares with consistently the lowest errors & best fits.
Tuning Hyperparameters found that the best non-linear (& overall) model was the Tuned Gradient Boost with best performing metrics of all fourteen models (lowest overall errors & highest overall fits).
Of course, as always, more data can improve the models. Especially in this case where one column had over 86% missing values ('New_price'), it was dropped to not interfere with the machine learning models. But intuitively, that should be an important factor in pricing a used car (i.e. a car should almost always depreciate in value over time, unless it becomes a rare collector's car).
As for further tuning, because we did see an overfit model where the training data significantly outperformed the test data, possible ways to address this could be:
As always, null values and outliers can have serious impacts on data modeling. I spent significant time in trying to impute some null values (in particular, researching MPGe equivalencies for the electric cars & converting to kmpl... and in one case, performing additional calculations using the range & battery size since the statistic was not advertised). But then when I removed outliers, those rows actually fell out. This then caused issues since I had manually entered those mileage values to those rows based on their indexes, so then my data now had two new rows with all null values except that mileage data. I finally figured it out and removed that code where I manually imputed that mileage data.
As previously mentioned, the Tuned Gradient Boost scored the best in every category. With the current feature importances ('Power' = 0.577 & 'Year' = 0.119), this means for Cars4U:
In 2018-19, Used Cars outsold New Cars by a margin of 4 million to 3.6 million. While new car prices are set by the manufacturer without flexibility, used cars have many factors affecting their value (how much they've been driven, how old they are, etc.). The trend of people buying used cars continues to rise, even replacing their current car with another used instead of opting for a new car. So a company like Cars4U needs to be able to accurately determine a used car's value in order to effectively price it so that it will sell and they will still earn a profit.
We are trying to develop a machine learning regression model that can use the available data to predict prices as accurately as possible (i.e. least amount of error and the best level of fit).
A few key questions that should direct our work when analyzing the data, building, training and tuning our models might be:
What kind of model is most effective in producing an accurate pricing function? How accurate will that model be? --> Tuned Gradient Boost model with about a $1,400 error rate and 92% price fit.
Which features have the biggest effect on the price of a used car? --> As we noted, 'Power' & 'Year' were the most important features. For each unit increased in each, the car's value increased by about \$2,136 & $1,352, respectively.
How does Cars4U actually use that model to drive their business? --> With the above information, the company should focus on trying to acquire more powerful cars that are newer, since those will have the biggest positive impact on the sale prices.
The particular branch of data science that we're using here is machine learning, which entails building regression models based on given data to help determine the numeric output based on a range of features (both numeric and categorical). In this case, we are using data on used car sales in India during 2018-19 that has features such as the model name, year, engine size & power, etc. as well as their new prices when they were released and their current values.
Unfortunately, nearly all of the new price data is missing, as is a good chunk of the current price data. This means that we have to look to other features to try and calculate a used car's value based on what information we do have. By focusing in on cars with full rows (no missing values), we can train different linear and nonlinear models to find one with the best performance metrics then tune hyperparameters to further improve that performance. Finally, we will then discover the importances (weights) of each feature so we can then translate those feature importances back to the context of the problem (namely, how much will a change in each feature affect the value of a car).
After training six linear and six nonlinear models (as well as refining a dataset that originally had many blank values), we found that a Tuned Gradient Boost model yielded the most accurate results (lowest errors and best fits) when trying to predict the current value of a used car given data from 2018-19 sales in India. With a 92% fit & an error rate of approximately $1,400 (difference between predicted and actual used car values), this model shows promise for use by Cars4U in determining listing prices of cars when selling. We also found that the impact of 'Power' (the most important feature as determined by our training) was ~\$2,100 increase in value per unit of bhp, and each newer year for the car saw an increase of ~\$1,400 per year.
However, we need to keep in mind that the performance of these models was calculated considering the vast amount of null (blank values) especially in cars' new prices when they were originally listed, as well as many current prices also missing (over 1,200). Before implementing this model full-throttle, it would behoove Cars4U to try and fill in as many of these missing prices (and other missing data) as possible using dealer/manufacturer information that should be available.
Used cars have outsold new cars in India recently, and the trend seems to be growing. In 2018-19, 4 million used cars were sold, while only 3.6 million new cars were sold. In fact, people are also replacing their current cars with used cars instead of new cars. And while the price of a new car is set by the manufacturer without any hassle, used cars are a different story and have many factors that affect their perceived value (the year, the make, the miles or kilometers on the odometer, etc.), which makes it difficult for a startup like Cars4U to accurately predict the price of a used car when trying to sell for a profit. A car priced too low would likely sell easily, but then the seller might not be making as much as they could. And if you price it too high to try and make more profit, it might not sell at all.
While a gradient boost model might not be simple to understand precisely how it calculated the performance metrics and feature importance (i.e. it uses a collection of decision trees and keeps building more based on slopes of a loss function and assigns weights to the subsequent trees... none of which are explicitly shown exactly how they are formed and aggregated), we can interpret the results with a little context. In our case, the model (when tuned to improve its performance) was able to fit about 92% of the data for current prices within about a \$1,400 range. Given that the average used car price here (after removing outliers) was about $12,000 that's not too bad. The above notebook shows detailed analyses of the numerical & categorical data along with explanations of what steps were taken to prepare the data for processing, as well as how to interpret the results & charts/tables.
In general, gradient boost models are quite popular because of their effectiveness in such contexts and objectives because of the "strong ensemble" idea... two heads are better than one, after all. That being said, there was quite a bit of overfit at the end and should not be ignored since the model performed significantly better on the training data than the test data. Initial attempts to address this (experimenting with hyperparameters) did not show improvement and caused extreme increase in execution time (ex: I disconnected one attempt after more than five minutes trying to compute one learning curve). The least squares linear model did not seem to suffer from this execution time increase, but its performance metrics were not as successful to begin with so it was not kept in this implementation.
As it stands now, we can expect roughly a \$2,100 increase in car value per unit of horsepower and about \$1,400 increase for each more recent year the car was made in (ex: if a 2016 model with 100 horsepower was valued at \$10,000 then a 2017 model of similar features but with 101 horsepower should have a value around \$10,000 + \$2,100 + \$1,400 = $13,500).
As previously mentioned, the main issues concerning the current model are:
Excessive null values (over 6,000 in new price & 1,200 in current price)
Abundance of high-end cars (serving as outliers) -- brands such as Lamborgini and Porsche likely have dedicated fans who have extra insight in how to price those cars and would not need/use a more generalized pricing model as is the focus here. Plus, the target client of this model (hypothetical Cars4U) is described as a startup, so they probably would not be too concerned with trying to acquire & sell those types of high-end cars.
The overfit of the final result that shows the test data performing below the training data. Tuning hyperparameters such as how many samples to use within each tree, how many folds to use during cross-validation, what type of search for ideal values (Random vs. Grid), or the ratio of how to split the training vs. test data could all help to further improve the performance metrics (errors & fit) and address the learning curve gaps (overfitting).
I would recommend the primary focus to improve this pricing model would be to try and gather more data for the existing cars in the spreadsheet to improve the accuracy. Gathering more data is often an easy response to this type of question, but it is especially important when there is so much missing. And here, finding how much cars originally cost when they were released should be a pretty straight-forward task since, as we mentioned, those new prices are set by the manufacturer. As for finding current values of those cars, there should be sales records of similar cars to fill in the gaps.
If that's not feasible for whatever reason, then replace cars for which accurate pricing data is not available with other car sales that do have sufficient data. And I would also suggest trying to find more from manufactures less represented but are certainly still popular cars (ex: Fiat & Isuzu have plenty of ads and models on the road, but they are hardly represented in this dataset. Their inclusion could really help the model accuracy).